SSD - Single Shot MultiBox Detector

Words List (appearance)
#	word	phonetic	sentence
1	MultiBox		SSD: Single Shot MultiBox Detector SSD：单发多盒检测器 Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. 在YOLO[5]的训练中、Faster R-CNN[2]和MultiBox[7]的区域提出阶段，一些版本也需要这样的操作。 We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]). 我们首先将每个实际边界框与具有最好的Jaccard重叠（如MultiBox[7]）的边界框相匹配。 Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). 与MultiBox不同的是，我们将默认边界框匹配到Jaccard重叠高于阈值（0.5）的任何实际边界框。 The SSD training objective is derived from the MultiBox objective[7,8] but is extended to handle multiple object categories. SSD训练目标函数来自于MultiBox目标[7,8]，但扩展到处理多个目标类别。 Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness. Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。 In the most recent works like MultiBox [7,8], the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network. 在最近的工作MultiBox[7,8]中，基于低级图像特征的选择性搜索区域提出直接被单独的深度神经网络生成的提出所取代。
2	discretize	['diskri:taiz]	Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. 我们的方法命名为SSD，将边界框的输出空间离散化为不同长宽比的一组默认框和并缩放每个特征映射的位置。 Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes. 在几个特征映射中允许不同的默认边界框形状让我们有效地离散可能的输出框形状的空间。
3	bounding	[baundɪŋ]	Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. 我们的方法命名为SSD，将边界框的输出空间离散化为不同长宽比的一组默认框和并缩放每个特征映射的位置。 Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. 目前最先进的目标检测系统是以下方法的变种：假设边界框，每个框重采样像素或特征，并应用一个高质量的分类器。 This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and and is as accurate as approaches that do. 本文提出了第一个基于深度网络的目标检测器，它不对边界框假设的像素或特征进行重采样，并且与其它方法有一样精确度。 The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. 速度的根本改进来自消除边界框提出和随后的像素或特征重采样阶段。 Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. 我们的改进包括使用小型卷积滤波器来预测边界框位置中的目标类别和偏移量，使用不同长宽比检测的单独预测器（滤波器），并将这些滤波器应用于网络后期的多个特征映射中，以执行多尺度检测。 The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps. SSD的核心是预测固定的一系列默认边界框的类别分数和边界框偏移，使用更小的卷积滤波器应用到特征映射上。 The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. SSD方法基于前馈卷积网络，该网络产生固定大小的边界框集合，并对这些边界框中存在的目标类别实例进行评分，然后进行非极大值抑制步骤来产生最终的检测结果。 The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step). 边界框偏移输出值是相对每个特征映射位置的相对默认框位置来度量的（查阅YOLO[5]的架构，该步骤使用中间全连接层而不是卷积滤波器）。 Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. 默认边界框和长宽比。对于网络顶部的多个特征映射，我们将一组默认边界框与每个特征映射单元相关联。 Similar to Faster R-CNN[2], we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h). 类似于Faster R-CNN[2]，我们回归默认边界框(d)的中心偏移量(cx, cy)和其宽度(w)、高度(h)的偏移量。 Figure 4 shows that SSD is very sensitive to the bounding box size. 图4显示SSD对边界框大小非常敏感。 Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness. Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。 Another set of methods, which are directly related to our approach, skip the proposal step altogether and predict bounding boxes and confidences for multiple categories directly. 与我们的方法直接相关的另一组方法，完全跳过提出步骤，直接预测多个类别的边界框和置信度。 OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories. OverFeat[4]是滑动窗口方法的深度版本，在知道了底层目标类别的置信度之后，直接从最顶层的特征映射的每个位置预测边界框。 YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories). YOLO[5]使用整个最顶层的特征映射来预测多个类别和边界框（这些类别共享）的置信度。 A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. 我们模型的一个关键特性是使用网络顶部多个特征映射的多尺度卷积边界框输出。 We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. 我们通过实验验证，在给定合适训练策略的情况下，大量仔细选择的默认边界框会提高性能。
4	Additionally	[ə'dɪʃənəlɪ]	Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. 此外，网络还结合了不同分辨率的多个特征映射的预测，自然地处理各种尺寸的目标。
5	encapsulate	[ɪnˈkæpsjuleɪt]	SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. 相对于需要目标提出的方法，SSD非常简单，因为它完全消除了提出生成和随后的像素或特征重新采样阶段，并将所有计算封装到单个网络中。
6	Pascal	['pæskәl]	Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. PASCAL VOC，COCO和ILSVRC数据集上的实验结果证实，SSD对于利用额外的目标提出步骤的方法具有竞争性的准确性，并且速度更快，同时为训练和推断提供了统一的框架。 This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. 自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。 While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from $63.4\%$ mAP for YOLO to $74.3\%$ mAP for our SSD. 虽然这些贡献可能单独看起来很小，但是我们注意到由此产生的系统将PASCAL VOC实时检测的准确度从YOLO的63.4%的mAP提高到我们的SSD的74.3%的mAP。 Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches. 实验包括在PASCAL VOC，COCO和ILSVRC上评估具有不同输入大小的模型的时间和精度分析，并与最近的一系列最新方法进行比较。 3.1 PASCAL VOC2007 3.1 PASCAL VOC2007 Table 1: PASCAL VOC2007 test detection results. 表1：PASCAL VOC2007 test检测结果。 3.3 PASCAL VOC2012 3.3 PASCAL VOC2012 Table 4: PASCAL VOC2012 test detection results. 表4： PASCAL VOC2012 test上的检测结果. Fast和Faster R-CNN使用最小维度为600的图像，而YOLO的图像大小为448× 48。 Since objects in COCO tend to be smaller than PASCAL VOC, we use smaller default boxes for all layers. 由于COCO中的目标往往比PASCAL VOC中的更小，因此我们对所有层使用较小的默认边界框。 Similar to what we observed on the PASCAL VOC dataset, SSD300 is better than Fast R-CNN in both mAP@0.5 and mAP@[0.5:0.95]. 与我们在PASCAL VOC数据集中观察到的结果类似，SSD300在mAP@0.5和mAP@[0.5:0.95]中都优于Fast R-CNN。 The data augmentation strategy described in Sec. 2.2 helps to improve the performance dramatically, especially on small datasets such as PASCAL VOC. 2.2描述的数据增强有助于显著提高性能，特别是在PASCAL VOC等小数据集上。 Table 7: Results on Pascal VOC2007 test. 表7：Pascal VOC2007 test上的结果。 Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster. 在PASCAL VOC和COCO上，我们的SSD512模型的性能明显优于最先进的Faster R-CNN[2]，而速度提高了3倍。
7	ILSVRC	[!≈ aɪ el es vi: ɑ:(r) si:]	Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. PASCAL VOC，COCO和ILSVRC数据集上的实验结果证实，SSD对于利用额外的目标提出步骤的方法具有竞争性的准确性，并且速度更快，同时为训练和推断提供了统一的框架。 This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. 自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。 Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches. 实验包括在PASCAL VOC，COCO和ILSVRC上评估具有不同输入大小的模型的时间和精度分析，并与最近的一系列最新方法进行比较。 Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16]. 基础网络。我们的实验全部基于VGG16[15]，它是在ILSVRC CLS-LOC数据集[16]上预先训练的。 3.5 Preliminary ILSVRC results 3.5 初步的ILSVRC结果 We applied the same network architecture we used for COCO to the ILSVRC DET dataset [16]. 我们将在COCO上应用的相同网络架构应用于ILSVRC DET数据集[16]。
8	VOC2007		For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model. 对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$). 这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 SSD with a 300 × 300 input size significantly outperforms its 448 × 448 YOLO counterpart in accuracy on VOC2007 test while also improving the speed. 300×300输入尺寸的SSD在VOC2007 test上的准确度上明显优于448×448的YOLO的准确度，同时也提高了速度。 3.1 PASCAL VOC2007 3.1 PASCAL VOC2007 On this dataset, we compare against Fast R-CNN [6] and Faster R-CNN [2] on VOC2007 test (4952 images). 在这个数据集上，我们在VOC2007 test（4952张图像）上比较了Fast R-CNN[6]和FAST R-CNN[2]。 When training on VOC2007 $\texttt{trainval}$, Table 1 shows that our low resolution SSD300 model is already more accurate than Fast R-CNN. 当对VOC2007 $\texttt{trainval}$进行训练时，表1显示了我们的低分辨率SSD300模型已经比Fast R-CNN更准确。 Table 1: PASCAL VOC2007 test detection results. 表1：PASCAL VOC2007 test检测结果。 Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12. “07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12. “07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test. 图3：SSD512在VOC2007 test中的动物，车辆和家具上的性能可视化。 Fig. 4: Sensitivity and impact of different object characteristics on VOC2007 test set using [21]. 图4：使用[21]在VOC2007 test设置上不同目标特性的灵敏度和影响。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images). 除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images). 除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We see the same performance trend as we observed on VOC2007 test. 我们看到了与我们在VOC2007 test中观察到的相同的性能趋势。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval. 数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。 Fig. 6: Sensitivity and impact of object size with new data augmentation on VOC2007 test set using [21]. 图6：具有新的数据增强的目标尺寸在[21]中使用的VOC2007test数据集上灵敏度及影响。 Table 7: Results on Pascal VOC2007 test. 表7：Pascal VOC2007 test上的结果。
9	FPS	['efp'i:'es]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model. 对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). 通常，这些方法的检测速度是以每帧秒（SPF）度量，甚至最快的高精度检测器，Faster R-CNN，仅以每秒7帧（FPS）的速度运行。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$). 这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$). 这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$). 这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 Although Fast YOLO[5] can run at 155 FPS, it has lower accuracy by almost $22\%$ mAP. 虽然Fast YOLO[5]可以以155FPS的速度运行，但其准确性却降低了近22%的mAP。 Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy. 我们的实时SSD300模型运行速度为59FPS，比目前的实时YOLO[5]更快，同时显著提高了检测精度。
10	Nvidia	[ɪn'vɪdɪə]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model. 对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 We thank NVIDIA for providing GPUs and acknowledge support from NSF 1452851, 1446631, 1526367, 1533771. 我们感谢NVIDIA提供的GPU，并对NSF 1452851,1446631,1526367,1533771的支持表示感谢。
11	Titan	[ˈtaɪtn]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model. 对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz. 我们使用Titan X、cuDNN v4、Intel Xeon E5-2667v3@3.20GHz以及批大小为8来测量速度。
12	comparable	[ˈkɒmpərəbl]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model. 对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 The SSD architecture combines predictions from feature maps of various resolutions to achieve comparable accuracy to Faster R-CNN, while using lower resolution input images. SSD架构将来自各种分辨率的特征映射的预测结合起来，以达到与Faster R-CNN相当的精确度，同时使用较低分辨率的输入图像。 Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance. 在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
13	variant	[ˈveəriənt]	Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. 目前最先进的目标检测系统是以下方法的变种：假设边界框，每个框重采样像素或特征，并应用一个高质量的分类器。
14	hypothesize	[haɪˈpɒθəsaɪz]	Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. 目前最先进的目标检测系统是以下方法的变种：假设边界框，每个框重采样像素或特征，并应用一个高质量的分类器。
15	prevail	[prɪˈveɪl]	This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. 自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。
16	selective	[sɪˈlektɪv]	This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. 自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。 Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance. 在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。 However, after the dramatic improvement brought on by R-CNN [22], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent. 然而，在R-CNN[22]结合选择性搜索区域提出和基于后分类的卷积网络带来的显著改进后，区域提出目标检测方法变得流行。 In the most recent works like MultiBox [7,8], the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network. 在最近的工作MultiBox[7,8]中，基于低级图像特征的选择性搜索区域提出直接被单独的深度神经网络生成的提出所取代。 Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks. Faster R-CNN[2]将选择性搜索提出替换为区域提出网络（RPN）学习到的区域提出，并引入了一种方法，通过交替两个网络之间的微调共享卷积层和预测层将RPN和Fast R-CNN结合在一起。
17	albeit	[ˌɔ:lˈbi:ɪt]	This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. 自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。
18	computationally	[!≈ ˌkɒmpjuˈteɪʃənli]	While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. 尽管这些方法准确，但对于嵌入式系统而言，这些方法的计算量过大，即使是高端硬件，对于实时应用而言也太慢。
19	SPF	[.es piː 'ef]	Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). 通常，这些方法的检测速度是以每帧秒（SPF）度量，甚至最快的高精度检测器，Faster R-CNN，仅以每秒7帧（FPS）的速度运行。
20	YOLO	[!≈ wai əu el əu]	This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$). 这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from $63.4\%$ mAP for YOLO to $74.3\%$ mAP for our SSD. 虽然这些贡献可能单独看起来很小，但是我们注意到由此产生的系统将PASCAL VOC实时检测的准确度从YOLO的63.4%的mAP提高到我们的SSD的74.3%的mAP。 We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, 我们引入了SSD，这是一种针对多个类别的单次检测器，比先前的先进的单次检测器（YOLO）更快，并且准确得多， The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map). 用于预测检测的卷积模型对于每个特征层都是不同的（查阅Overfeat[4]和YOLO[5]在单尺度特征映射上的操作）。 The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step). 边界框偏移输出值是相对每个特征映射位置的相对默认框位置来度量的（查阅YOLO[5]的架构，该步骤使用中间全连接层而不是卷积滤波器）。 Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5]. 图2：两个单次检测模型的比较：SSD和YOLO[5]。 SSD with a 300 × 300 input size significantly outperforms its 448 × 448 YOLO counterpart in accuracy on VOC2007 test while also improving the speed. 300×300输入尺寸的SSD在VOC2007 test上的准确度上明显优于448×448的YOLO的准确度，同时也提高了速度。 Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7]. 在YOLO[5]的训练中、Faster R-CNN[2]和MultiBox[7]的区域提出阶段，一些版本也需要这样的操作。 We use a more extensive sampling strategy, similar to YOLO [5]. 我们使用更广泛的抽样策略，类似于YOLO[5]。 Compared to YOLO, SSD is significantly more accurate, likely due to the use of convolutional default boxes from multiple feature maps and our matching strategy during training. 与YOLO相比，SSD更精确，可能是由于使用了来自多个特征映射的卷积默认边界框和我们在训练期间的匹配策略。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval. 数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。 Table 7 shows the comparison between SSD, Faster R-CNN[2], and YOLO[5]. 表7显示了SSD，Faster R-CNN[2]和YOLO[5]之间的比较。 Although Fast YOLO[5] can run at 155 FPS, it has lower accuracy by almost $22\%$ mAP. 虽然Fast YOLO[5]可以以155FPS的速度运行，但其准确性却降低了近22%的mAP。 YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories). YOLO[5]使用整个最顶层的特征映射来预测多个类别和边界框（这些类别共享）的置信度。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5]. 如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。 Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy. 我们的实时SSD300模型运行速度为59FPS，比目前的实时YOLO[5]更快，同时显著提高了检测精度。
21	cf		We are not the first to do this (cf [4,5]), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. 我们并不是第一个这样做的人（查阅[4,5]），但是通过增加一系列改进，我们设法比以前的尝试显著提高了准确性。 The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map). 用于预测检测的卷积模型对于每个特征层都是不同的（查阅Overfeat[4]和YOLO[5]在单尺度特征映射上的操作）。 The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step). 边界框偏移输出值是相对每个特征映射位置的相对默认框位置来度量的（查阅YOLO[5]的架构，该步骤使用中间全连接层而不是卷积滤波器）。
22	predictor	[prɪˈdɪktə(r)]	Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. 我们的改进包括使用小型卷积滤波器来预测边界框位置中的目标类别和偏移量，使用不同长宽比检测的单独预测器（滤波器），并将这些滤波器应用于网络后期的多个特征映射中，以执行多尺度检测。 Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. 用于检测的卷积预测器。每个添加的特征层（或者任选的来自基础网络的现有特征层）可以使用一组卷积滤波器产生固定的检测预测集合。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5]. 如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。
23	residual	[rɪˈzɪdjuəl]	This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3]. 相比于最近备受瞩目的残差网络方面的工作[3]，在检测精度上这是相对更大的提高。
24	trade-off	[ˈtreɪdˌɔ:f, -ˌɔf]	These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off. 这些设计功能使得即使在低分辨率输入图像上也能实现简单的端到端训练和高精度，从而进一步提高速度与精度之间的权衡。
25	dataset-specific	[!≈ 'deɪtəset spəˈsɪfɪk]	Afterwards, Sec. 2.3 presents dataset-specific model details and experimental results. 之后，2.3节介绍了数据集特有的模型细节和实验结果。
26	feed-forward	['fi:df'ɔ:wəd]	The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. SSD方法基于前馈卷积网络，该网络产生固定大小的边界框集合，并对这些边界框中存在的目标类别实例进行评分，然后进行非极大值抑制步骤来产生最终的检测结果。
27	suppression	[səˈpreʃn]	The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. SSD方法基于前馈卷积网络，该网络产生固定大小的边界框集合，并对这些边界框中存在的目标类别实例进行评分，然后进行非极大值抑制步骤来产生最终的检测结果。 Considering the large number of boxes generated from our method, it is essential to perform non-maximum suppression (nms) efficiently during inference. 考虑到我们的方法产生大量边界框，在推断期间执行非最大值抑制（nms）是必要的。
28	truncated	['trʌŋkeɪtɪd]	The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network. 早期的网络层基于用于高质量图像分类的标准架构（在任何分类层之前被截断），我们将其称为基础网络。 Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network. 用于检测的多尺度特征映射。我们将卷积特征层添加到截取的基础网络的末端。
29	auxiliary	[ɔ:gˈzɪliəri]	We then add auxiliary structure to the network to produce detections with the following key features: 然后，我们将辅助结构添加到网络中以产生具有以下关键特征的检测：
30	progressively	[prəˈgresɪvli]	These layers decrease in size progressively and allow predictions of detections at multiple scales. 这些层在尺寸上逐渐减小，并允许在多个尺度上对检测结果进行预测。 To measure the advantage gained, we progressively remove layers and compare results. 为了衡量所获得的优势，我们逐步删除层并比较结果。
31	Overfeat		The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map). 用于预测检测的卷积模型对于每个特征层都是不同的（查阅Overfeat[4]和YOLO[5]在单尺度特征映射上的操作）。 OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories. OverFeat[4]是滑动窗口方法的深度版本，在知道了底层目标类别的置信度之后，直接从最顶层的特征映射的每个位置预测边界框。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5]. 如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。
32	optionally	['ɒpʃənəlɪ]	Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. 用于检测的卷积预测器。每个添加的特征层（或者任选的来自基础网络的现有特征层）可以使用一组卷积滤波器产生固定的检测预测集合。
33	tile	[taɪl]	The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. 默认边界框以卷积的方式平铺特征映射，以便每个边界框相对于其对应单元的位置是固定的。 We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. 我们设计平铺默认边界框，以便特定的特征映射学习响应目标的特定尺度。 How to design the optimal tiling is an open question as well. 如何设计最佳平铺也是一个悬而未决的问题。 For a fair comparison, every time we remove a layer, we adjust the default box tiling to keep the total number of boxes similar to the original (8732). 为了公平比较，每次我们删除一层，我们调整默认边界框平铺，以保持类似于最初的边界框的总数（8732）。 We do not exhaustively optimize the tiling for each setting. 我们没有详尽地优化每个设置的平铺。 An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map. 改进SSD的另一种方法是设计一个更好的默认边界框平铺，使其位置和尺度与特征映射上每个位置的感受野更好地对齐。
34	kmn		This results in a total of (c+4)k filters that are applied around each location in the feature map, yielding (c+4)kmn outputs for a $m\times n$ feature map. 这导致在特征映射中的每个位置周围应用总共(c+4)k个滤波器，对于$m\times n$的特征映射取得(c+4)kmn个输出。
35	Fig.1.		For an illustration of default boxes, please refer to Fig.1. 有关默认边界框的说明，请参见图1。
36	e.g.	[ˌi: ˈdʒi:]	(a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8 × 8 and 4 × 4 in (b) and (c)). 以卷积方式，我们评估具有不同尺度（例如（b）和（c）中的8×8和4×4）的几个特征映射中每个位置处不同长宽比的默认框的小集合（例如4个）。 (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8 × 8 and 4 × 4 in (b) and (c)). 以卷积方式，我们评估具有不同尺度（例如（b）和（c）中的8×8和4×4）的几个特征映射中每个位置处不同长宽比的默认框的小集合（例如4个）。 The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax). 模型损失是定位损失（例如，Smooth L1[6]）和置信度损失（例如Softmax）之间的加权和。 The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax). 模型损失是定位损失（例如，Smooth L1[6]）和置信度损失（例如Softmax）之间的加权和。 Increasing the input size (e.g. from 300 × 300 to 512 × 512) can help improve detecting small objects, but there is still a lot of room to improve. 增加输入尺寸（例如从300×300到512×512）可以帮助改进检测小目标，但仍然有很大的改进空间。 For example, it hurts the performance by a large margin if we use very coarse feature maps (e.g. conv11_2 (1 × 1) or conv10_2 (3 × 3)). 例如，如果我们使用非常粗糙的特征映射（例如conv11_2（1×1）或conv10_2（3×3）），它会大大伤害性能。 We follow the strategy mentioned in Sec. 2.2, but now our smallest default box has a scale of 0.15 instead of 0.2, and the scale of the default box on conv4_3 is 0.07 (e.g. 21 pixels for a 300 × 300 image). 我们遵循2.2节中提到的策略，但是现在我们最小的默认边界框尺度是0.15而不是0.2，并且conv4_3上的默认边界框尺度是0.07（例如，300×300图像中的21个像素）。
37	propagation	[ˌprɒpə'ɡeɪʃn]	Once this assignment is determined, the loss function and back propagation are applied end-to-end. 一旦确定了这个分配，损失函数和反向传播就可以应用端到端了。 Since, as pointed out in [12], conv4_3 has a different feature scale compared to the other layers, we use the L2 normalization technique introduced in [12] to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation. 如[12]所指出的，与其它层相比，由于conv4_3具有不同的特征尺度，所以我们使用[12]中引入的L2正则化技术将特征映射中每个位置的特征标准缩放到20，在反向传播过程中学习尺度。
38	mining	[ˈmaɪnɪŋ]	Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies. 训练也涉及选择默认边界框集合和缩放进行检测，以及难例挖掘和数据增强策略。 Hard negative mining After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. 难例挖掘。在匹配步骤之后，大多数默认边界框为负例，尤其是当可能的默认边界框数量较多时。
39	augmentation	[ˌɔ:ɡmen'teɪʃn]	Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies. 训练也涉及选择默认边界框集合和缩放进行检测，以及难例挖掘和数据增强策略。 Data augmentation. 数据增强。 Data augmentation is crucial. 数据增强至关重要。 3.6 Data Augmentation for Small Object Accuracy 3.6 为小目标准确率进行数据增强 The data augmentation strategy described in Sec. 2.2 helps to improve the performance dramatically, especially on small datasets such as PASCAL VOC. 2.2描述的数据增强有助于显著提高性能，特别是在PASCAL VOC等小数据集上。 Because we have more training images by introducing this new “expansion” data augmentation trick, we have to double the training iterations. 因为通过引入这个新的“扩展”数据增强技巧，我们有更多的训练图像，所以我们必须将训练迭代次数加倍。 In specific, Figure 6 shows that the new augmentation trick significantly improves the performance on small objects. 具体来说，图6显示新的增强技巧显著提高了模型在小目标上的性能。 This result underscores the importance of the data augmentation strategy for the final model accuracy. 这个结果强调了数据增强策略对最终模型精度的重要性。 Table 6: Results on multiple datasets when we add the image expansion data augmentation trick. 表6：我们使用图像扩展数据增强技巧在多个数据集上的结果。 $SSD300^{}$ and $SSD512^{}$ are the models that are trained with the new data augmentation. $SSD300^{}$和$SSD512^{}$是用新的数据增强训练的模型。 Fig. 6: Sensitivity and impact of object size with new data augmentation on VOC2007 test set using [21]. 图6：具有新的数据增强的目标尺寸在[21]中使用的VOC2007test数据集上灵敏度及影响。 The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{}$ and $SSD512^{}$ model trained with the new data augmentation trick. 最上一行显示了原始SSD300和SSD512模型上每个类别的BBox面积的影响，最下面一行对应使用新的数据增强训练技巧的$SSD300^{}$和$SSD512^{}$模型。 It is obvious that the new data augmentation trick helps detecting small objects significantly. 新的数据增强技巧显然有助于显著检测小目标。
40	jaccard		We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]). 我们首先将每个实际边界框与具有最好的Jaccard重叠（如MultiBox[7]）的边界框相匹配。 Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5). 与MultiBox不同的是，我们将默认边界框匹配到Jaccard重叠高于阈值（0.5）的任何实际边界框。 Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9. 采样一个图像块，使得与目标之间的最小Jaccard重叠为0.1，0.3，0.5，0.7或0.9。 The recall is around 85-90\%, and is much higher with “weak” (0.1 jaccard overlap) criteria. 召回约为85-90\%，而“弱”（0.1 Jaccard重叠）标准则要高得多。 The solid red line reflects the change of recall with strong criteria (0.5 jaccard overlap) as the number of detections increases. 红色的实线表示随着检测次数的增加，强标准（0.5 Jaccard重叠）下的召回变化。 The dashed red line is using the weak criteria (0.1 jaccard overlap). 红色虚线是使用弱标准（0.1 Jaccard重叠）。 We then apply nms with jaccard overlap of 0.45 per class and keep the top 200 detections per image. 然后，我们应用nms，每个类别0.45的Jaccard重叠，并保留每张图像的前200个检测。
41	indicator	[ˈɪndɪkeɪtə(r)]	Let $x_{ij}^p = \lbrace 1,0 \rbrace$ be an indicator for matching the i-th default box to the j-th ground truth box of category p. 设$x_{ij}^p = \lbrace 1,0 \rbrace$是第i个默认边界框匹配到类别p的第j个实际边界框的指示器。
42	regress	[rɪˈgres]	Similar to Faster R-CNN[2], we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h). 类似于Faster R-CNN[2]，我们回归默认边界框(d)的中心偏移量(cx, cy)和其宽度(w)、高度(h)的偏移量。 Compared to R-CNN [22], SSD has less localization error, indicating that SSD can localize objects better because it directly learns to regress the object shape and classify object categories instead of using two decoupled steps. 与R-CNN[22]相比，SSD具有更小的定位误差，表明SSD可以更好地定位目标，因为它直接学习回归目标形状和分类目标类别，而不是使用两个解耦步骤。
43	mimic	[ˈmɪmɪk]	However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. 然而，通过利用单个网络中几个不同层的特征映射进行预测，我们可以模拟相同的效果，同时还可以跨所有目标尺度共享参数。
44	semantic	[sɪˈmæntɪk]	Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. 以前的工作[10,11]已经表明，使用低层的特征映射可以提高语义分割的质量，因为低层会捕获输入目标的更多细节。
45	exemplar	[ɪgˈzemplɑ:(r)]	Figure 1 shows two exemplar feature maps (8 × 8 and 4 × 4) which are used in the framework. 图1显示了框架中使用的两个示例性特征映射（8×8和4×4）。
46	empirical	[ɪmˈpɪrɪkl]	Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13]. 已知网络中不同层的特征映射具有不同的（经验的）感受野大小[13]。
47	receptive	[rɪˈseptɪv]	Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13]. 已知网络中不同层的特征映射具有不同的（经验的）感受野大小[13]。 Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each layer. 幸运的是，在SSD框架内，默认边界框不需要对应于每层的实际感受野。 An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map. 改进SSD的另一种方法是设计一个更好的默认边界框平铺，使其位置和尺度与特征映射上每个位置的感受野更好地对齐。
48	imbalance	[ɪmˈbæləns]	This introduces a significant imbalance between the positive and negative training examples. 这在正的训练实例和负的训练实例之间引入了显著的不平衡。
49	aforementioned	[əˌfɔ:ˈmenʃənd]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14]. 在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
50	resize	[ˌri:ˈsaɪz]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14]. 在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
51	horizontally	[ˌhɒrɪ'zɒntəlɪ]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14]. 在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
52	flip	[flɪp]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14]. 在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。 Fast and Faster R-CNN use the original image and the horizontal flip to train. Fast和Faster R-CNN使用原始图像和水平翻转来训练。
53	photo-metric	[!≈ ˈfəʊtəʊ ˈmetrɪk]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14]. 在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
54	VGG16[15		Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16]. 基础网络。我们的实验全部基于VGG16[15]，它是在ILSVRC CLS-LOC数据集[16]上预先训练的。
55	CLS-LOC		Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16]. 基础网络。我们的实验全部基于VGG16[15]，它是在ILSVRC CLS-LOC数据集[16]上预先训练的。
56	DeepLab-LargeFOV		Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2$-s2 to $3\times 3$-s1, and use the atrous algorithm[18] to fill the “holes”. 类似于DeepLab-LargeFOV[17]，我们将fc6和fc7转换为卷积层，从fc6和fc7中重采样参数，将pool5从$2\times 2$-s2更改为$3\times 3$-s1，并使用空洞算法[18]来填补这个“小洞”。 As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17]. 如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。
57	subsample	['sʌbsɑ:mpl]	Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2$-s2 to $3\times 3$-s1, and use the atrous algorithm[18] to fill the “holes”. 类似于DeepLab-LargeFOV[17]，我们将fc6和fc7转换为卷积层，从fc6和fc7中重采样参数，将pool5从$2\times 2$-s2更改为$3\times 3$-s1，并使用空洞算法[18]来填补这个“小洞”。
58	atrous	['eitrәs]	Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2$-s2 to $3\times 3$-s1, and use the atrous algorithm[18] to fill the “holes”. 类似于DeepLab-LargeFOV[17]，我们将fc6和fc7转换为卷积层，从fc6和fc7中重采样参数，将pool5从$2\times 2$-s2更改为$3\times 3$-s1，并使用空洞算法[18]来填补这个“小洞”。 Atrous is faster. Atrous更快。 As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17]. 如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。
59	SGD	['esdʒ'i:d'i:]	We fine-tune the resulting model using SGD with initial learning rate $10^{-3}$, 0.9 momentum, 0.0005 weight decay, and batch size 32. 我们使用SGD对得到的模型进行微调，初始学习率为$10^{-3}$，动量为0.9，权重衰减为0.0005，批数据大小为32。
60	momentum	[məˈmentəm]	We fine-tune the resulting model using SGD with initial learning rate $10^{-3}$, 0.9 momentum, 0.0005 weight decay, and batch size 32. 我们使用SGD对得到的模型进行微调，初始学习率为$10^{-3}$，动量为0.9，权重衰减为0.0005，批数据大小为32。
61	Caffe		The full training and testing code is built on Caffe[19] and is open source at: https://github.com/weiliu89/caffe/tree/ssd. 完整的训练和测试代码建立在Caffe[19]上并开源：https://github.com/weiliu89/caffe/tree/ssd。
62	VGG16		All methods fine-tune on the same pre-trained VGG16 network. 所有的方法都在相同的预训练好的VGG16网络上进行微调。 As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17]. 如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。 If we use the full VGG16, keeping pool5 with 2×2−s2 and not subsampling parameters from fc6 and fc7, and add conv5 3 for prediction, the result is about the same while the speed is about $20\%$ slower. 如果我们使用完整的VGG16，保持pool5为2×2-s2，并且不从fc6和fc7中子采样参数，并添加conv5_3进行预测，结果大致相同，而速度慢了大约20%。 Note that about $80\%$ of the forward time is spent on the base network (VGG16 in our case). 请注意，大约80%前馈时间花费在基础网络上（本例中为VGG16）。
63	SSD300		Figure 2 shows the architecture details of the SSD300 model. 图2显示了SSD300模型的架构细节。 When training on VOC2007 $\texttt{trainval}$, Table 1 shows that our low resolution SSD300 model is already more accurate than Fast R-CNN. 当对VOC2007 $\texttt{trainval}$进行训练时，表1显示了我们的低分辨率SSD300模型已经比Fast R-CNN更准确。 If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by 1.1\% and that SSD512 is $3.6\%$ better. 如果我们用更多的（即07+12）数据来训练SSD，我们看到SSD300已经比Faster R-CNN好$1.1\%$，SSD512比Faster R-CNN好$3.6\%$。 Table 4 shows the results of our SSD300 and SSD512 model. 表4显示了我们的SSD300和SSD512模型的结果。 Our SSD300 improves accuracy over Fast/Faster R-CNN. 我们的SSD300比Fast/Faster R-CNN提高了准确性。 To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset. 为了进一步验证SSD框架，我们在COCO数据集上对SSD300和SSD512架构进行了训练。 Similar to what we observed on the PASCAL VOC dataset, SSD300 is better than Fast R-CNN in both mAP@0.5 and mAP@[0.5:0.95]. 与我们在PASCAL VOC数据集中观察到的结果类似，SSD300在mAP@0.5和mAP@[0.5:0.95]中都优于Fast R-CNN。 SSD300 has a similar mAP@0.75 as ION [24] and Faster R-CNN [25], but is worse in mAP@0.5. SSD300与ION 24]和Faster R-CNN[25]具有相似的mAP@0.75，但是mAP@0.5更差。 We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [22]. 我们使用[22]中使用的ILSVRC2014 DETtrain和val1来训练SSD300模型。 The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{}$ and $SSD512^{}$ model trained with the new data augmentation trick. 最上一行显示了原始SSD300和SSD512模型上每个类别的BBox面积的影响，最下面一行对应使用新的数据增强训练技巧的$SSD300^{}$和$SSD512^{}$模型。 This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers. 对于SSD300和20个VOC类别，这个步骤每张图像花费大约1.7毫秒，接近在所有新增层上花费的总时间（2.4毫秒）。 Both our SSD300 and SSD512 method outperforms Faster R-CNN in both speed and accuracy. 我们的SSD300和SSD512的速度和精度均优于Faster R-CNN。 To the best of our knowledge, SSD300 is the first real-time method to achieve above $70\%$ mAP. 就我们所知，SSD300是第一个实现70%以上mAP的实时方法。 SSD300 is the only real-time detection method that can achieve above 70\% mAP. SSD300是唯一可以取得70\%以上mAP的实现检测方法。 Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy. 我们的实时SSD300模型运行速度为59FPS，比目前的实时YOLO[5]更快，同时显著提高了检测精度。
64	xavier	['zʌvɪə]	We initialize the parameters for all the newly added convolutional layers with the “xavier” method [20]. 我们使用“xavier”方法[20]初始化所有新添加的卷积层的参数。
65	surpass	[səˈpɑ:s]	When we train SSD on a larger $512\times 512$ input image, it is even more accurate, surpassing Faster R-CNN by $1.7\%$ mAP. 当我们用更大的$512\times 512$输入图像上训练SSD时，它更加准确，超过了Faster R-CNN $1.7\%$的mAP。
66	i.e.	[ˌaɪ ˈi:]	If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by 1.1\% and that SSD512 is $3.6\%$ better. 如果我们用更多的（即07+12）数据来训练SSD，我们看到SSD300已经比Faster R-CNN好$1.1\%$，SSD512比Faster R-CNN好$3.6\%$。
67	SSD512		If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by 1.1\% and that SSD512 is $3.6\%$ better. 如果我们用更多的（即07+12）数据来训练SSD，我们看到SSD300已经比Faster R-CNN好$1.1\%$，SSD512比Faster R-CNN好$3.6\%$。 If we take models trained on COCO $\texttt{trainval35k}$ as described in Sec. 3.4 and fine-tuning them on the 07+12 dataset with SSD512, we achieve the best results: 81.6\% mAP. 如果我们将SSD512用3.4节描述的COCO $\texttt{trainval35k}$来训练模型并在07+12数据集上进行微调，我们获得了最好的结果：$81.6\%$的mAP。 Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test. 图3：SSD512在VOC2007 test中的动物，车辆和家具上的性能可视化。 Table 4 shows the results of our SSD300 and SSD512 model. 表4显示了我们的SSD300和SSD512模型的结果。 When fine-tuned from models trained on COCO, our SSD512 achieves $80.0\%$ mAP, which is $4.1\%$ higher than Faster R-CNN. 当对从COCO上训练的模型进行微调后，我们的SSD512达到了80.0%的mAP，比Faster R-CNN高了4.1%。 To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset. 为了进一步验证SSD框架，我们在COCO数据集上对SSD300和SSD512架构进行了训练。 By increasing the image size to 512 × 512, our SSD512 is better than Faster R-CNN [25] in both criteria. 通过将图像尺寸增加到512×512，我们的SSD512在这两个标准中都优于Faster R-CNN[25]。 Interestingly, we observe that SSD512 is 5.3\% better in mAP@0.75, but is only $1.2\%$ better in mAP@0.5. 有趣的是，我们观察到SSD512在mAP@0.75中要好5.3%，但是在mAP@0.5中只好1.2%。 In Fig. 5, we show some detection examples on COCO test-dev with the SSD512 model. 在图5中，我们展示了SSD512模型在COCO test-dev上的一些检测实例。 Fig. 5: Detection examples on COCO test-dev with SSD512 model. 图5：SSD512模型在COCO test-dev上的检测实例。 The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{}$ and $SSD512^{}$ model trained with the new data augmentation trick. 最上一行显示了原始SSD300和SSD512模型上每个类别的BBox面积的影响，最下面一行对应使用新的数据增强训练技巧的$SSD300^{}$和$SSD512^{}$模型。 Both our SSD300 and SSD512 method outperforms Faster R-CNN in both speed and accuracy. 我们的SSD300和SSD512的速度和精度均优于Faster R-CNN。 Therefore, using a faster base network could even further improve the speed, which can possibly make the SSD512 model real-time as well. 因此，使用更快的基础网络可以进一步提高速度，这也可能使SSD512模型达到实时。 By using a larger input image, SSD512 outperforms all methods on accuracy while maintaining a close to real-time speed. 通过使用更大的输入图像，SSD512在精度上超过了所有方法同时保持近似实时的速度。 Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster. 在PASCAL VOC和COCO上，我们的SSD512模型的性能明显优于最先进的Faster R-CNN[2]，而速度提高了3倍。
68	trainval		Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12. “07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12. “07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images). 除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images). 除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval. 数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval. 数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。
69	VOC2012		Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12. “07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 3.3 PASCAL VOC2012 3.3 PASCAL VOC2012 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images). 除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images). 除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 Table 4: PASCAL VOC2012 test detection results. 表4： PASCAL VOC2012 test上的检测结果. Fast和Faster R-CNN使用最小维度为600的图像，而YOLO的图像大小为448× 48。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval. 数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。
70	trainval35k		Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12. “07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 ”07++12+COCO”: first train on COCO trainval35k then fine-tune on 07++12. “07++12+COCO”：先在COCO trainval135k上训练然后在07++12上微调。 We use the trainval35k[24] for training. 我们使用trainval35k[24]进行训练。
71	decouple	[di:ˈkʌpl]	Compared to R-CNN [22], SSD has less localization error, indicating that SSD can localize objects better because it directly learns to regress the object shape and classify object categories instead of using two decoupled steps. 与R-CNN[22]相比，SSD具有更小的定位误差，表明SSD可以更好地定位目标，因为它直接学习回归目标形状和分类目标类别，而不是使用两个解耦步骤。
72	Visualization	[ˌvɪʒʊəlaɪ'zeɪʃn]	Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test. 图3：SSD512在VOC2007 test中的动物，车辆和家具上的性能可视化。
73	cumulative	[ˈkju:mjələtɪv]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). 第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
74	Cor	[kɔ:(r)]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). 第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
75	Sim	[sɪm]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). 第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
76	Oth		The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). 第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
77	BG	[!≈ bi: dʒi:]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG). 第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
78	dash	[dæʃ]	The dashed red line is using the weak criteria (0.1 jaccard overlap). 红色虚线是使用弱标准（0.1 Jaccard重叠）。
79	top-ranked	['tɒpr'æŋkt]	The bottom row shows the distribution of top-ranked false positive types. 最下面一行显示了排名靠前的假阳性类型的分布。
80	Sensitivity	[ˌsensəˈtɪvəti]	Fig. 4: Sensitivity and impact of different object characteristics on VOC2007 test set using [21]. 图4：使用[21]在VOC2007 test设置上不同目标特性的灵敏度和影响。 Fig. 6: Sensitivity and impact of object size with new data augmentation on VOC2007 test set using [21]. 图6：具有新的数据增强的目标尺寸在[21]中使用的VOC2007test数据集上灵敏度及影响。
81	XW	[!≈ eks 'dʌblju:]	Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide. 长宽比：XT=超高/窄；T=高；M=中等；W=宽；XW =超宽。
82	extra-wide	[!≈ ˈekstrə waɪd]	Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide. 长宽比：XT=超高/窄；T=高；M=中等；W=宽；XW =超宽。
83	subsampled		As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17]. 如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。
84	subsampling		If we use the full VGG16, keeping pool5 with 2×2−s2 and not subsampling parameters from fc6 and fc7, and add conv5 3 for prediction, the result is about the same while the speed is about $20\%$ slower. 如果我们使用完整的VGG16，保持pool5为2×2-s2，并且不从fc6和fc7中子采样参数，并添加conv5_3进行预测，结果大致相同，而速度慢了大约20%。
85	exhaustively	[ɪɡ'zɔ:stɪvlɪ]	We do not exhaustively optimize the tiling for each setting. 我们没有详尽地优化每个设置的平铺。
86	monotonically	[mɒnə'tɒnɪklɪ]	Table 3 shows a decrease in accuracy with fewer layers, dropping monotonically from 74.3 to 62.4. 表3显示层数较少，精度降低，从74.3单调递减至62.4。
87	prune	[pru:n]	The reason might be that we do not have enough large boxes to cover large objects after the pruning. 原因可能是修剪后我们没有足够大的边界框来覆盖大的目标。 When we use primarily finer resolution maps, the performance starts increasing again because even after pruning a sufficient number of large boxes remains. 当我们主要使用更高分辨率的特征映射时，性能开始再次上升，因为即使在修剪之后仍然有足够数量的大边界框。
88	ROI	[rwɑ:]	Besides, since our predictions do not rely on ROI pooling as in [6], we do not have the collapsing bins problem in low-resolution feature maps [23]. 此外，由于我们的预测不像[6]那样依赖于ROI池化，所以我们在低分辨率特征映射中没有折叠组块的问题[23]。
89	validate	[ˈvælɪdeɪt]	To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset. 为了进一步验证SSD框架，我们在COCO数据集上对SSD300和SSD512架构进行了训练。 Again, it validates that SSD is a general framework for high quality real-time detection. 再一次证明了SSD是用于高质量实时检测的通用框架。 We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. 我们通过实验验证，在给定合适训练策略的情况下，大量仔细选择的默认边界框会提高性能。
90	test-dev	[!≈ test dev]	Table 5 shows the results on test-dev2015. 表5显示了test-dev2015的结果。 In Fig. 5, we show some detection examples on COCO test-dev with the SSD512 model. 在图5中，我们展示了SSD512模型在COCO test-dev上的一些检测实例。 Table 5: COCO test-dev2015 detection results. 表5：COCO test-dev2015检测结果。 Fig. 5: Detection examples on COCO test-dev with SSD512 model. 图5：SSD512模型在COCO test-dev上的检测实例。
91	ION	[ˈaɪən]	SSD300 has a similar mAP@0.75 as ION [24] and Faster R-CNN [25], but is worse in mAP@0.5. SSD300与ION 24]和Faster R-CNN[25]具有相似的mAP@0.75，但是mAP@0.5更差。 Compared to ION, the improvement in AR for large and small objects is more similar ($5.4\%$ vs. 与ION相比，大型和小型目标的AR改进更为相似（5.4%和3.9%）。
92	conjecture	[kənˈdʒektʃə(r)]	$3.9\%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part. 我们推测Faster R-CNN在较小的目标上比SSD更具竞争力，因为它在RPN部分和Fast R-CNN部分都执行了两个边界框细化步骤。
93	refinement	[rɪˈfaɪnmənt]	$3.9\%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part. 我们推测Faster R-CNN在较小的目标上比SSD更具竞争力，因为它在RPN部分和Fast R-CNN部分都执行了两个边界框细化步骤。
94	RPN	[!≈ ɑ:(r) pi: en]	$3.9\%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part. 我们推测Faster R-CNN在较小的目标上比SSD更具竞争力，因为它在RPN部分和Fast R-CNN部分都执行了两个边界框细化步骤。 Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks. Faster R-CNN[2]将选择性搜索提出替换为区域提出网络（RPN）学习到的区域提出，并引入了一种方法，通过交替两个网络之间的微调共享卷积层和预测层将RPN和Fast R-CNN结合在一起。 Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks. Faster R-CNN[2]将选择性搜索提出替换为区域提出网络（RPN）学习到的区域提出，并引入了一种方法，通过交替两个网络之间的微调共享卷积层和预测层将RPN和Fast R-CNN结合在一起。 Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN. 我们的SSD与Faster R-CNN中的区域提出网络（RPN）非常相似，因为我们也使用一组固定的（默认）边界框进行预测，类似于RPN中的锚边界框。 Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN. 我们的SSD与Faster R-CNN中的区域提出网络（RPN）非常相似，因为我们也使用一组固定的（默认）边界框进行预测，类似于RPN中的锚边界框。 Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks. 因此，我们的方法避免了将RPN与Fast R-CNN合并的复杂性，并且更容易训练，更快且更直接地集成到其它任务中。
95	Preliminary	[prɪˈlɪmɪnəri]	3.5 Preliminary ILSVRC results 3.5 初步的ILSVRC结果
96	DET	[!≈ di: i: ti:]	We applied the same network architecture we used for COCO to the ILSVRC DET dataset [16]. 我们将在COCO上应用的相同网络架构应用于ILSVRC DET数据集[16]。 We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [22]. 我们使用[22]中使用的ILSVRC2014 DETtrain和val1来训练SSD300模型。
97	ILSVRC2014		We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [22]. 我们使用[22]中使用的ILSVRC2014 DETtrain和val1来训练SSD300模型。
98	follow-up	['fɒləʊ ʌp]	Without a follow-up feature resampling step as in Faster R-CNN, the classification task for small objects is relatively hard for SSD, as demonstrated in our analysis (see Fig. 4). SSD没有如Faster R-CNN中后续的特征重采样步骤，小目标的分类任务对SSD来说相对困难，正如我们的分析（见图4）所示。
99	zoom	[zu:m]	The random crops generated by the strategy can be thought of as a “zoom in” operation and can generate many larger training examples. 策略产生的随机裁剪可以被认为是“放大”操作，并且可以产生许多更大的训练样本。 To implement a “zoom out” operation that creates more small training examples, we first randomly place an image on a canvas of 16× of the original image size filled with mean values before we do any random crop operation. 为了实现创建更多小型训练样本的“缩小”操作，我们首先将图像随机放置在填充了平均值的原始图像大小为16x的画布上，然后再进行任意的随机裁剪操作。
100	underscore	[ˌʌndəˈskɔ:(r)]	This result underscores the importance of the data augmentation strategy for the final model accuracy. 这个结果强调了数据增强策略对最终模型精度的重要性。
101	align	[əˈlaɪn]	An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map. 改进SSD的另一种方法是设计一个更好的默认边界框平铺，使其位置和尺度与特征映射上每个位置的感受野更好地对齐。
102	nms		Considering the large number of boxes generated from our method, it is essential to perform non-maximum suppression (nms) efficiently during inference. 考虑到我们的方法产生大量边界框，在推断期间执行非最大值抑制（nms）是必要的。 We then apply nms with jaccard overlap of 0.45 per class and keep the top 200 detections per image. 然后，我们应用nms，每个类别0.45的Jaccard重叠，并保留每张图像的前200个检测。
103	msec	[m'zek]	This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers. 对于SSD300和20个VOC类别，这个步骤每张图像花费大约1.7毫秒，接近在所有新增层上花费的总时间（2.4毫秒）。 This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers. 对于SSD300和20个VOC类别，这个步骤每张图像花费大约1.7毫秒，接近在所有新增层上花费的总时间（2.4毫秒）。
104	cuDNN		We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz. 我们使用Titan X、cuDNN v4、Intel Xeon E5-2667v3@3.20GHz以及批大小为8来测量速度。
105	Xeon		We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz. 我们使用Titan X、cuDNN v4、Intel Xeon E5-2667v3@3.20GHz以及批大小为8来测量速度。
106	advent	[ˈædvent]	Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance. 在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
107	Deformable	[dɪ'fɔ:məbl]	Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance. 在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
108	DPM	[!≈ di: pi: em]	Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance. 在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
109	prevalent	[ˈprevələnt]	However, after the dramatic improvement brought on by R-CNN [22], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent. 然而，在R-CNN[22]结合选择性搜索区域提出和基于后分类的卷积网络带来的显著改进后，区域提出目标检测方法变得流行。
110	time-consuming	[taɪm kən'sju:mɪŋ]	The first set of approaches improve the quality and speed of post-classification, since it requires the classification of thousands of image crops, which is expensive and time-consuming. 第一套方法提高了后分类的质量和速度，因为它需要对成千上万的裁剪图像进行分类，这是昂贵和耗时的。
111	SPPnet		SPPnet [9] speeds up the original R-CNN approach significantly. SPPnet[9]显著加快了原有的R-CNN方法。 Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness. Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。
112	objectness	[!≈ ˈɒbdʒɪktnəs]	Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness. Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。
113	setup	['setʌp]	This further improves the detection accuracy but results in a somewhat complex setup, requiring the training of two neural networks with a dependency between them. 这进一步提高了检测精度，但是导致了一些复杂的设置，需要训练两个具有依赖关系的神经网络。
114	complication	[ˌkɒmplɪˈkeɪʃn]	Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks. 因此，我们的方法避免了将RPN与Fast R-CNN合并的复杂性，并且更容易训练，更快且更直接地集成到其它任务中。
115	topmost	[ˈtɒpməʊst]	OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories. OverFeat[4]是滑动窗口方法的深度版本，在知道了底层目标类别的置信度之后，直接从最顶层的特征映射的每个位置预测边界框。 YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories). YOLO[5]使用整个最顶层的特征映射来预测多个类别和边界框（这些类别共享）的置信度。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5]. 如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5]. 如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。
116	experimentally	[ɪkˌsperɪ'mentəlɪ]	We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. 我们通过实验验证，在给定合适训练策略的情况下，大量仔细选择的默认边界框会提高性能。
117	favorably	['feɪvərəblɪ]	We demonstrate that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed. 我们证明了给定相同的VGG-16基础架构，SSD在准确性和速度方面与其对应的最先进的目标检测器相比毫不逊色。
118	standalone	['stændəˌləʊn]	Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component. 除了单独使用之外，我们相信我们的整体和相对简单的SSD模型为采用目标检测组件的大型系统提供了有用的构建模块。
119	monolithic	[ˌmɒnə'lɪθɪk]	Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component. 除了单独使用之外，我们相信我们的整体和相对简单的SSD模型为采用目标检测组件的大型系统提供了有用的构建模块。
120	recurrent	[rɪˈkʌrənt]	A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video simultaneously. 一个有前景的未来方向是探索它作为系统的一部分，使用循环神经网络来同时检测和跟踪视频中的目标。
121	Acknowledgment	[ək'nɒlɪdʒmənt]	6. Acknowledgment 6. 致谢
122	internship	[ˈɪntɜ:nʃɪp]	This work was started as an internship project at Google and continued at UNC. 这项工作是在谷歌的一个实习项目开始的，并在UNC继续。
123	UNC	[ʌŋk]	This work was started as an internship project at Google and continued at UNC. 这项工作是在谷歌的一个实习项目开始的，并在UNC继续。
124	Alex	['ælɪkʃ]	We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google. 我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
125	Toshev		We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google. 我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
126	indebted	[ɪnˈdetɪd]	We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google. 我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
127	DistBelief		We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google. 我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
128	Ammirato		We also thank Philip Ammirato and Patrick Poirson for helpful comments. 我们也感谢Philip Ammirato和Patrick Poirson提供有用的意见。
129	Patrick	[ˈpætrik]	We also thank Philip Ammirato and Patrick Poirson for helpful comments. 我们也感谢Philip Ammirato和Patrick Poirson提供有用的意见。
130	Poirson		We also thank Philip Ammirato and Patrick Poirson for helpful comments. 我们也感谢Philip Ammirato和Patrick Poirson提供有用的意见。
131	NSF	[!≈ en es ef]	We thank NVIDIA for providing GPUs and acknowledge support from NSF 1452851, 1446631, 1526367, 1533771. 我们感谢NVIDIA提供的GPU，并对NSF 1452851,1446631,1526367,1533771的支持表示感谢。

Words List (frequency)
#	word (frequency)	phonetic	sentence
1	bounding (17)	[baundɪŋ]	Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.我们的方法命名为SSD，将边界框的输出空间离散化为不同长宽比的一组默认框和并缩放每个特征映射的位置。 Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier.目前最先进的目标检测系统是以下方法的变种：假设边界框，每个框重采样像素或特征，并应用一个高质量的分类器。 This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and and is as accurate as approaches that do.本文提出了第一个基于深度网络的目标检测器，它不对边界框假设的像素或特征进行重采样，并且与其它方法有一样精确度。 The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage.速度的根本改进来自消除边界框提出和随后的像素或特征重采样阶段。 Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.我们的改进包括使用小型卷积滤波器来预测边界框位置中的目标类别和偏移量，使用不同长宽比检测的单独预测器（滤波器），并将这些滤波器应用于网络后期的多个特征映射中，以执行多尺度检测。 The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.SSD的核心是预测固定的一系列默认边界框的类别分数和边界框偏移，使用更小的卷积滤波器应用到特征映射上。 The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections.SSD方法基于前馈卷积网络，该网络产生固定大小的边界框集合，并对这些边界框中存在的目标类别实例进行评分，然后进行非极大值抑制步骤来产生最终的检测结果。 The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).边界框偏移输出值是相对每个特征映射位置的相对默认框位置来度量的（查阅YOLO[5]的架构，该步骤使用中间全连接层而不是卷积滤波器）。 Default boxes and aspect ratios We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network.默认边界框和长宽比。对于网络顶部的多个特征映射，我们将一组默认边界框与每个特征映射单元相关联。 Similar to Faster R-CNN[2], we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h).类似于Faster R-CNN[2]，我们回归默认边界框(d)的中心偏移量(cx, cy)和其宽度(w)、高度(h)的偏移量。 Figure 4 shows that SSD is very sensitive to the bounding box size.图4显示SSD对边界框大小非常敏感。 Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness.Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。 Another set of methods, which are directly related to our approach, skip the proposal step altogether and predict bounding boxes and confidences for multiple categories directly.与我们的方法直接相关的另一组方法，完全跳过提出步骤，直接预测多个类别的边界框和置信度。 OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories.OverFeat[4]是滑动窗口方法的深度版本，在知道了底层目标类别的置信度之后，直接从最顶层的特征映射的每个位置预测边界框。 YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories).YOLO[5]使用整个最顶层的特征映射来预测多个类别和边界框（这些类别共享）的置信度。 A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network.我们模型的一个关键特性是使用网络顶部多个特征映射的多尺度卷积边界框输出。 We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance.我们通过实验验证，在给定合适训练策略的情况下，大量仔细选择的默认边界框会提高性能。
2	VOC2007 (17)		For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model.对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$).这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 SSD with a 300 × 300 input size significantly outperforms its 448 × 448 YOLO counterpart in accuracy on VOC2007 test while also improving the speed.300×300输入尺寸的SSD在VOC2007 test上的准确度上明显优于448×448的YOLO的准确度，同时也提高了速度。 3.1 PASCAL VOC20073.1 PASCAL VOC2007 On this dataset, we compare against Fast R-CNN [6] and Faster R-CNN [2] on VOC2007 test (4952 images).在这个数据集上，我们在VOC2007 test（4952张图像）上比较了Fast R-CNN[6]和FAST R-CNN[2]。 When training on VOC2007 $\texttt{trainval}$, Table 1 shows that our low resolution SSD300 model is already more accurate than Fast R-CNN.当对VOC2007 $\texttt{trainval}$进行训练时，表1显示了我们的低分辨率SSD300模型已经比Fast R-CNN更准确。 Table 1: PASCAL VOC2007 test detection results.表1：PASCAL VOC2007 test检测结果。 Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.“07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.“07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test.图3：SSD512在VOC2007 test中的动物，车辆和家具上的性能可视化。 Fig. 4: Sensitivity and impact of different object characteristics on VOC2007 test set using [21].图4：使用[21]在VOC2007 test设置上不同目标特性的灵敏度和影响。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images).除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images).除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We see the same performance trend as we observed on VOC2007 test.我们看到了与我们在VOC2007 test中观察到的相同的性能趋势。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval.数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。 Fig. 6: Sensitivity and impact of object size with new data augmentation on VOC2007 test set using [21].图6：具有新的数据增强的目标尺寸在[21]中使用的VOC2007test数据集上灵敏度及影响。 Table 7: Results on Pascal VOC2007 test.表7：Pascal VOC2007 test上的结果。
3	YOLO (16)	[!≈ wai əu el əu]	This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$).这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from $63.4\%$ mAP for YOLO to $74.3\%$ mAP for our SSD.虽然这些贡献可能单独看起来很小，但是我们注意到由此产生的系统将PASCAL VOC实时检测的准确度从YOLO的63.4%的mAP提高到我们的SSD的74.3%的mAP。 We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate,我们引入了SSD，这是一种针对多个类别的单次检测器，比先前的先进的单次检测器（YOLO）更快，并且准确得多， The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map).用于预测检测的卷积模型对于每个特征层都是不同的（查阅Overfeat[4]和YOLO[5]在单尺度特征映射上的操作）。 The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).边界框偏移输出值是相对每个特征映射位置的相对默认框位置来度量的（查阅YOLO[5]的架构，该步骤使用中间全连接层而不是卷积滤波器）。 Fig. 2: A comparison between two single shot detection models: SSD and YOLO [5].图2：两个单次检测模型的比较：SSD和YOLO[5]。 SSD with a 300 × 300 input size significantly outperforms its 448 × 448 YOLO counterpart in accuracy on VOC2007 test while also improving the speed.300×300输入尺寸的SSD在VOC2007 test上的准确度上明显优于448×448的YOLO的准确度，同时也提高了速度。 Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7].在YOLO[5]的训练中、Faster R-CNN[2]和MultiBox[7]的区域提出阶段，一些版本也需要这样的操作。 We use a more extensive sampling strategy, similar to YOLO [5].我们使用更广泛的抽样策略，类似于YOLO[5]。 Compared to YOLO, SSD is significantly more accurate, likely due to the use of convolutional default boxes from multiple feature maps and our matching strategy during training.与YOLO相比，SSD更精确，可能是由于使用了来自多个特征映射的卷积默认边界框和我们在训练期间的匹配策略。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval.数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。 Table 7 shows the comparison between SSD, Faster R-CNN[2], and YOLO[5].表7显示了SSD，Faster R-CNN[2]和YOLO[5]之间的比较。 Although Fast YOLO[5] can run at 155 FPS, it has lower accuracy by almost $22\%$ mAP.虽然Fast YOLO[5]可以以155FPS的速度运行，但其准确性却降低了近22%的mAP。 YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories).YOLO[5]使用整个最顶层的特征映射来预测多个类别和边界框（这些类别共享）的置信度。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。 Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.我们的实时SSD300模型运行速度为59FPS，比目前的实时YOLO[5]更快，同时显著提高了检测精度。
4	SSD300 (15)		Figure 2 shows the architecture details of the SSD300 model.图2显示了SSD300模型的架构细节。 When training on VOC2007 $\texttt{trainval}$, Table 1 shows that our low resolution SSD300 model is already more accurate than Fast R-CNN.当对VOC2007 $\texttt{trainval}$进行训练时，表1显示了我们的低分辨率SSD300模型已经比Fast R-CNN更准确。 If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by 1.1\% and that SSD512 is $3.6\%$ better.如果我们用更多的（即07+12）数据来训练SSD，我们看到SSD300已经比Faster R-CNN好$1.1\%$，SSD512比Faster R-CNN好$3.6\%$。 Table 4 shows the results of our SSD300 and SSD512 model.表4显示了我们的SSD300和SSD512模型的结果。 Our SSD300 improves accuracy over Fast/Faster R-CNN.我们的SSD300比Fast/Faster R-CNN提高了准确性。 To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset.为了进一步验证SSD框架，我们在COCO数据集上对SSD300和SSD512架构进行了训练。 Similar to what we observed on the PASCAL VOC dataset, SSD300 is better than Fast R-CNN in both mAP@0.5 and mAP@[0.5:0.95].与我们在PASCAL VOC数据集中观察到的结果类似，SSD300在mAP@0.5和mAP@[0.5:0.95]中都优于Fast R-CNN。 SSD300 has a similar mAP@0.75 as ION [24] and Faster R-CNN [25], but is worse in mAP@0.5.SSD300与ION 24]和Faster R-CNN[25]具有相似的mAP@0.75，但是mAP@0.5更差。 We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [22].我们使用[22]中使用的ILSVRC2014 DETtrain和val1来训练SSD300模型。 The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{}$ and $SSD512^{}$ model trained with the new data augmentation trick.最上一行显示了原始SSD300和SSD512模型上每个类别的BBox面积的影响，最下面一行对应使用新的数据增强训练技巧的$SSD300^{}$和$SSD512^{}$模型。 This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers.对于SSD300和20个VOC类别，这个步骤每张图像花费大约1.7毫秒，接近在所有新增层上花费的总时间（2.4毫秒）。 Both our SSD300 and SSD512 method outperforms Faster R-CNN in both speed and accuracy.我们的SSD300和SSD512的速度和精度均优于Faster R-CNN。 To the best of our knowledge, SSD300 is the first real-time method to achieve above $70\%$ mAP.就我们所知，SSD300是第一个实现70%以上mAP的实时方法。 SSD300 is the only real-time detection method that can achieve above 70\% mAP.SSD300是唯一可以取得70\%以上mAP的实现检测方法。 Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.我们的实时SSD300模型运行速度为59FPS，比目前的实时YOLO[5]更快，同时显著提高了检测精度。
5	SSD512 (15)		If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by 1.1\% and that SSD512 is $3.6\%$ better.如果我们用更多的（即07+12）数据来训练SSD，我们看到SSD300已经比Faster R-CNN好$1.1\%$，SSD512比Faster R-CNN好$3.6\%$。 If we take models trained on COCO $\texttt{trainval35k}$ as described in Sec. 3.4 and fine-tuning them on the 07+12 dataset with SSD512, we achieve the best results: 81.6\% mAP.如果我们将SSD512用3.4节描述的COCO $\texttt{trainval35k}$来训练模型并在07+12数据集上进行微调，我们获得了最好的结果：$81.6\%$的mAP。 Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test.图3：SSD512在VOC2007 test中的动物，车辆和家具上的性能可视化。 Table 4 shows the results of our SSD300 and SSD512 model.表4显示了我们的SSD300和SSD512模型的结果。 When fine-tuned from models trained on COCO, our SSD512 achieves $80.0\%$ mAP, which is $4.1\%$ higher than Faster R-CNN.当对从COCO上训练的模型进行微调后，我们的SSD512达到了80.0%的mAP，比Faster R-CNN高了4.1%。 To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset.为了进一步验证SSD框架，我们在COCO数据集上对SSD300和SSD512架构进行了训练。 By increasing the image size to 512 × 512, our SSD512 is better than Faster R-CNN [25] in both criteria.通过将图像尺寸增加到512×512，我们的SSD512在这两个标准中都优于Faster R-CNN[25]。 Interestingly, we observe that SSD512 is 5.3\% better in mAP@0.75, but is only $1.2\%$ better in mAP@0.5.有趣的是，我们观察到SSD512在mAP@0.75中要好5.3%，但是在mAP@0.5中只好1.2%。 In Fig. 5, we show some detection examples on COCO test-dev with the SSD512 model.在图5中，我们展示了SSD512模型在COCO test-dev上的一些检测实例。 Fig. 5: Detection examples on COCO test-dev with SSD512 model.图5：SSD512模型在COCO test-dev上的检测实例。 The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{}$ and $SSD512^{}$ model trained with the new data augmentation trick.最上一行显示了原始SSD300和SSD512模型上每个类别的BBox面积的影响，最下面一行对应使用新的数据增强训练技巧的$SSD300^{}$和$SSD512^{}$模型。 Both our SSD300 and SSD512 method outperforms Faster R-CNN in both speed and accuracy.我们的SSD300和SSD512的速度和精度均优于Faster R-CNN。 Therefore, using a faster base network could even further improve the speed, which can possibly make the SSD512 model real-time as well.因此，使用更快的基础网络可以进一步提高速度，这也可能使SSD512模型达到实时。 By using a larger input image, SSD512 outperforms all methods on accuracy while maintaining a close to real-time speed.通过使用更大的输入图像，SSD512在精度上超过了所有方法同时保持近似实时的速度。 Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster.在PASCAL VOC和COCO上，我们的SSD512模型的性能明显优于最先进的Faster R-CNN[2]，而速度提高了3倍。
6	Pascal (13)	['pæskәl]	Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference.PASCAL VOC，COCO和ILSVRC数据集上的实验结果证实，SSD对于利用额外的目标提出步骤的方法具有竞争性的准确性，并且速度更快，同时为训练和推断提供了统一的框架。 This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3].自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。 While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from $63.4\%$ mAP for YOLO to $74.3\%$ mAP for our SSD.虽然这些贡献可能单独看起来很小，但是我们注意到由此产生的系统将PASCAL VOC实时检测的准确度从YOLO的63.4%的mAP提高到我们的SSD的74.3%的mAP。 Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.实验包括在PASCAL VOC，COCO和ILSVRC上评估具有不同输入大小的模型的时间和精度分析，并与最近的一系列最新方法进行比较。 3.1 PASCAL VOC20073.1 PASCAL VOC2007 Table 1: PASCAL VOC2007 test detection results.表1：PASCAL VOC2007 test检测结果。 3.3 PASCAL VOC20123.3 PASCAL VOC2012 Table 4: PASCAL VOC2012 test detection results.表4： PASCAL VOC2012 test上的检测结果. Fast和Faster R-CNN使用最小维度为600的图像，而YOLO的图像大小为448× 48。 Since objects in COCO tend to be smaller than PASCAL VOC, we use smaller default boxes for all layers.由于COCO中的目标往往比PASCAL VOC中的更小，因此我们对所有层使用较小的默认边界框。 Similar to what we observed on the PASCAL VOC dataset, SSD300 is better than Fast R-CNN in both mAP@0.5 and mAP@[0.5:0.95].与我们在PASCAL VOC数据集中观察到的结果类似，SSD300在mAP@0.5和mAP@[0.5:0.95]中都优于Fast R-CNN。 The data augmentation strategy described in Sec. 2.2 helps to improve the performance dramatically, especially on small datasets such as PASCAL VOC.2.2描述的数据增强有助于显著提高性能，特别是在PASCAL VOC等小数据集上。 Table 7: Results on Pascal VOC2007 test.表7：Pascal VOC2007 test上的结果。 Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster.在PASCAL VOC和COCO上，我们的SSD512模型的性能明显优于最先进的Faster R-CNN[2]，而速度提高了3倍。
7	augmentation (13)	[ˌɔ:ɡmen'teɪʃn]	Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.训练也涉及选择默认边界框集合和缩放进行检测，以及难例挖掘和数据增强策略。 Data augmentation.数据增强。 Data augmentation is crucial.数据增强至关重要。 3.6 Data Augmentation for Small Object Accuracy3.6 为小目标准确率进行数据增强 The data augmentation strategy described in Sec. 2.2 helps to improve the performance dramatically, especially on small datasets such as PASCAL VOC.2.2描述的数据增强有助于显著提高性能，特别是在PASCAL VOC等小数据集上。 Because we have more training images by introducing this new “expansion” data augmentation trick, we have to double the training iterations.因为通过引入这个新的“扩展”数据增强技巧，我们有更多的训练图像，所以我们必须将训练迭代次数加倍。 In specific, Figure 6 shows that the new augmentation trick significantly improves the performance on small objects.具体来说，图6显示新的增强技巧显著提高了模型在小目标上的性能。 This result underscores the importance of the data augmentation strategy for the final model accuracy.这个结果强调了数据增强策略对最终模型精度的重要性。 Table 6: Results on multiple datasets when we add the image expansion data augmentation trick.表6：我们使用图像扩展数据增强技巧在多个数据集上的结果。 $SSD300^{}$ and $SSD512^{}$ are the models that are trained with the new data augmentation.$SSD300^{}$和$SSD512^{}$是用新的数据增强训练的模型。 Fig. 6: Sensitivity and impact of object size with new data augmentation on VOC2007 test set using [21].图6：具有新的数据增强的目标尺寸在[21]中使用的VOC2007test数据集上灵敏度及影响。 The top row shows the effects of BBox Area per category for the original SSD300 and SSD512 model, and the bottom row corresponds to the $SSD300^{}$ and $SSD512^{}$ model trained with the new data augmentation trick.最上一行显示了原始SSD300和SSD512模型上每个类别的BBox面积的影响，最下面一行对应使用新的数据增强训练技巧的$SSD300^{}$和$SSD512^{}$模型。 It is obvious that the new data augmentation trick helps detecting small objects significantly.新的数据增强技巧显然有助于显著检测小目标。
8	MultiBox (7)		SSD: Single Shot MultiBox DetectorSSD：单发多盒检测器 Some version of this is also required for training in YOLO[5] and for the region proposal stage of Faster R-CNN[2] and MultiBox[7].在YOLO[5]的训练中、Faster R-CNN[2]和MultiBox[7]的区域提出阶段，一些版本也需要这样的操作。 We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]).我们首先将每个实际边界框与具有最好的Jaccard重叠（如MultiBox[7]）的边界框相匹配。 Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5).与MultiBox不同的是，我们将默认边界框匹配到Jaccard重叠高于阈值（0.5）的任何实际边界框。 The SSD training objective is derived from the MultiBox objective[7,8] but is extended to handle multiple object categories.SSD训练目标函数来自于MultiBox目标[7,8]，但扩展到处理多个目标类别。 Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness.Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。 In the most recent works like MultiBox [7,8], the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network.在最近的工作MultiBox[7,8]中，基于低级图像特征的选择性搜索区域提出直接被单独的深度神经网络生成的提出所取代。
9	FPS (7)	['efp'i:'es]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model.对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS).通常，这些方法的检测速度是以每帧秒（SPF）度量，甚至最快的高精度检测器，Faster R-CNN，仅以每秒7帧（FPS）的速度运行。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$).这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$).这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP $74.3\%$ on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP $73.2\%$ or YOLO 45 FPS with mAP $63.4\%$).这对高精度检测在速度上有显著提高（在VOC2007测试中，59FPS和74.3%的mAP，与Faster R-CNN 7FPS和73.2%的mAP或者YOLO 45 FPS和63.4%的mAP相比）。 Although Fast YOLO[5] can run at 155 FPS, it has lower accuracy by almost $22\%$ mAP.虽然Fast YOLO[5]可以以155FPS的速度运行，但其准确性却降低了近22%的mAP。 Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.我们的实时SSD300模型运行速度为59FPS，比目前的实时YOLO[5]更快，同时显著提高了检测精度。
10	e.g. (7)	[ˌi: ˈdʒi:]	(a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8 × 8 and 4 × 4 in (b) and (c)).以卷积方式，我们评估具有不同尺度（例如（b）和（c）中的8×8和4×4）的几个特征映射中每个位置处不同长宽比的默认框的小集合（例如4个）。 (a) SSD only needs an input image and ground truth boxes for each object during training. In a convolutional fashion, we evaluate a small set (e.g. 4) of default boxes of different aspect ratios at each location in several feature maps with different scales (e.g. 8 × 8 and 4 × 4 in (b) and (c)).以卷积方式，我们评估具有不同尺度（例如（b）和（c）中的8×8和4×4）的几个特征映射中每个位置处不同长宽比的默认框的小集合（例如4个）。 The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).模型损失是定位损失（例如，Smooth L1[6]）和置信度损失（例如Softmax）之间的加权和。 The model loss is a weighted sum between localization loss (e.g. Smooth L1 [6]) and confidence loss (e.g. Softmax).模型损失是定位损失（例如，Smooth L1[6]）和置信度损失（例如Softmax）之间的加权和。 Increasing the input size (e.g. from 300 × 300 to 512 × 512) can help improve detecting small objects, but there is still a lot of room to improve.增加输入尺寸（例如从300×300到512×512）可以帮助改进检测小目标，但仍然有很大的改进空间。 For example, it hurts the performance by a large margin if we use very coarse feature maps (e.g. conv11_2 (1 × 1) or conv10_2 (3 × 3)).例如，如果我们使用非常粗糙的特征映射（例如conv11_2（1×1）或conv10_2（3×3）），它会大大伤害性能。 We follow the strategy mentioned in Sec. 2.2, but now our smallest default box has a scale of 0.15 instead of 0.2, and the scale of the default box on conv4_3 is 0.07 (e.g. 21 pixels for a 300 × 300 image).我们遵循2.2节中提到的策略，但是现在我们最小的默认边界框尺度是0.15而不是0.2，并且conv4_3上的默认边界框尺度是0.07（例如，300×300图像中的21个像素）。
11	jaccard (7)		We begin by matching each ground truth box to the default box with the best jaccard overlap (as in MultiBox [7]).我们首先将每个实际边界框与具有最好的Jaccard重叠（如MultiBox[7]）的边界框相匹配。 Unlike MultiBox, we then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5).与MultiBox不同的是，我们将默认边界框匹配到Jaccard重叠高于阈值（0.5）的任何实际边界框。 Sample a patch so that the minimum jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.采样一个图像块，使得与目标之间的最小Jaccard重叠为0.1，0.3，0.5，0.7或0.9。 The recall is around 85-90\%, and is much higher with “weak” (0.1 jaccard overlap) criteria.召回约为85-90\%，而“弱”（0.1 Jaccard重叠）标准则要高得多。 The solid red line reflects the change of recall with strong criteria (0.5 jaccard overlap) as the number of detections increases.红色的实线表示随着检测次数的增加，强标准（0.5 Jaccard重叠）下的召回变化。 The dashed red line is using the weak criteria (0.1 jaccard overlap).红色虚线是使用弱标准（0.1 Jaccard重叠）。 We then apply nms with jaccard overlap of 0.45 per class and keep the top 200 detections per image.然后，我们应用nms，每个类别0.45的Jaccard重叠，并保留每张图像的前200个检测。
12	ILSVRC (6)	[!≈ aɪ el es vi: ɑ:(r) si:]	Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference.PASCAL VOC，COCO和ILSVRC数据集上的实验结果证实，SSD对于利用额外的目标提出步骤的方法具有竞争性的准确性，并且速度更快，同时为训练和推断提供了统一的框架。 This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3].自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。 Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.实验包括在PASCAL VOC，COCO和ILSVRC上评估具有不同输入大小的模型的时间和精度分析，并与最近的一系列最新方法进行比较。 Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16].基础网络。我们的实验全部基于VGG16[15]，它是在ILSVRC CLS-LOC数据集[16]上预先训练的。 3.5 Preliminary ILSVRC results3.5 初步的ILSVRC结果 We applied the same network architecture we used for COCO to the ILSVRC DET dataset [16].我们将在COCO上应用的相同网络架构应用于ILSVRC DET数据集[16]。
13	tile (6)	[taɪl]	The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed.默认边界框以卷积的方式平铺特征映射，以便每个边界框相对于其对应单元的位置是固定的。 We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects.我们设计平铺默认边界框，以便特定的特征映射学习响应目标的特定尺度。 How to design the optimal tiling is an open question as well.如何设计最佳平铺也是一个悬而未决的问题。 For a fair comparison, every time we remove a layer, we adjust the default box tiling to keep the total number of boxes similar to the original (8732).为了公平比较，每次我们删除一层，我们调整默认边界框平铺，以保持类似于最初的边界框的总数（8732）。 We do not exhaustively optimize the tiling for each setting.我们没有详尽地优化每个设置的平铺。 An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map.改进SSD的另一种方法是设计一个更好的默认边界框平铺，使其位置和尺度与特征映射上每个位置的感受野更好地对齐。
14	trainval (6)		Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.“07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.“07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images).除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images).除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval.数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval.数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。
15	VOC2012 (6)		Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.“07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 3.3 PASCAL VOC20123.3 PASCAL VOC2012 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images).除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 We use the same settings as those used for our basic VOC2007 experiments above, except that we use VOC2012 trainval and VOC2007 trainval and test (21503 images) for training, and test on VOC2012 test (10991 images).除了我们使用VOC2012 trainval和VOC2007 trainval，test（21503张图像）进行训练，以及在VOC2012 test（10991张图像）上进行测试之外，我们使用与上述基本的VOC2007实验相同的设置。 Table 4: PASCAL VOC2012 test detection results.表4： PASCAL VOC2012 test上的检测结果. Fast和Faster R-CNN使用最小维度为600的图像，而YOLO的图像大小为448× 48。 Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval.数据：“07++12”：VOC2007 trainval，test和VOC2012 trainval。
16	RPN (6)	[!≈ ɑ:(r) pi: en]	$3.9\%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part.我们推测Faster R-CNN在较小的目标上比SSD更具竞争力，因为它在RPN部分和Fast R-CNN部分都执行了两个边界框细化步骤。 Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks.Faster R-CNN[2]将选择性搜索提出替换为区域提出网络（RPN）学习到的区域提出，并引入了一种方法，通过交替两个网络之间的微调共享卷积层和预测层将RPN和Fast R-CNN结合在一起。 Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks.Faster R-CNN[2]将选择性搜索提出替换为区域提出网络（RPN）学习到的区域提出，并引入了一种方法，通过交替两个网络之间的微调共享卷积层和预测层将RPN和Fast R-CNN结合在一起。 Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN.我们的SSD与Faster R-CNN中的区域提出网络（RPN）非常相似，因为我们也使用一组固定的（默认）边界框进行预测，类似于RPN中的锚边界框。 Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN.我们的SSD与Faster R-CNN中的区域提出网络（RPN）非常相似，因为我们也使用一组固定的（默认）边界框进行预测，类似于RPN中的锚边界框。 Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.因此，我们的方法避免了将RPN与Fast R-CNN合并的复杂性，并且更容易训练，更快且更直接地集成到其它任务中。
17	selective (5)	[sɪˈlektɪv]	This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3].自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。 Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance.在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。 However, after the dramatic improvement brought on by R-CNN [22], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent.然而，在R-CNN[22]结合选择性搜索区域提出和基于后分类的卷积网络带来的显著改进后，区域提出目标检测方法变得流行。 In the most recent works like MultiBox [7,8], the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network.在最近的工作MultiBox[7,8]中，基于低级图像特征的选择性搜索区域提出直接被单独的深度神经网络生成的提出所取代。 Faster R-CNN [2] replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks.Faster R-CNN[2]将选择性搜索提出替换为区域提出网络（RPN）学习到的区域提出，并引入了一种方法，通过交替两个网络之间的微调共享卷积层和预测层将RPN和Fast R-CNN结合在一起。
18	VGG16 (4)		All methods fine-tune on the same pre-trained VGG16 network.所有的方法都在相同的预训练好的VGG16网络上进行微调。 As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17].如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。 If we use the full VGG16, keeping pool5 with 2×2−s2 and not subsampling parameters from fc6 and fc7, and add conv5 3 for prediction, the result is about the same while the speed is about $20\%$ slower.如果我们使用完整的VGG16，保持pool5为2×2-s2，并且不从fc6和fc7中子采样参数，并添加conv5_3进行预测，结果大致相同，而速度慢了大约20%。 Note that about $80\%$ of the forward time is spent on the base network (VGG16 in our case).请注意，大约80%前馈时间花费在基础网络上（本例中为VGG16）。
19	test-dev (4)	[!≈ test dev]	Table 5 shows the results on test-dev2015.表5显示了test-dev2015的结果。 In Fig. 5, we show some detection examples on COCO test-dev with the SSD512 model.在图5中，我们展示了SSD512模型在COCO test-dev上的一些检测实例。 Table 5: COCO test-dev2015 detection results.表5：COCO test-dev2015检测结果。 Fig. 5: Detection examples on COCO test-dev with SSD512 model.图5：SSD512模型在COCO test-dev上的检测实例。
20	topmost (4)	[ˈtɒpməʊst]	OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories.OverFeat[4]是滑动窗口方法的深度版本，在知道了底层目标类别的置信度之后，直接从最顶层的特征映射的每个位置预测边界框。 YOLO [5] uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories).YOLO[5]使用整个最顶层的特征映射来预测多个类别和边界框（这些类别共享）的置信度。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。
21	comparable (3)	[ˈkɒmpərəbl]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model.对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 The SSD architecture combines predictions from feature maps of various resolutions to achieve comparable accuracy to Faster R-CNN, while using lower resolution input images.SSD架构将来自各种分辨率的特征映射的预测结合起来，以达到与Faster R-CNN相当的精确度，同时使用较低分辨率的输入图像。 Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance.在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
22	cf (3)		We are not the first to do this (cf [4,5]), but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts.我们并不是第一个这样做的人（查阅[4,5]），但是通过增加一系列改进，我们设法比以前的尝试显著提高了准确性。 The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map).用于预测检测的卷积模型对于每个特征层都是不同的（查阅Overfeat[4]和YOLO[5]在单尺度特征映射上的操作）。 The bounding box offset output values are measured relative to a default box position relative to each feature map location (cf the architecture of YOLO[5] that uses an intermediate fully connected layer instead of a convolutional filter for this step).边界框偏移输出值是相对每个特征映射位置的相对默认框位置来度量的（查阅YOLO[5]的架构，该步骤使用中间全连接层而不是卷积滤波器）。
23	predictor (3)	[prɪˈdɪktə(r)]	Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.我们的改进包括使用小型卷积滤波器来预测边界框位置中的目标类别和偏移量，使用不同长宽比检测的单独预测器（滤波器），并将这些滤波器应用于网络后期的多个特征映射中，以执行多尺度检测。 Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters.用于检测的卷积预测器。每个添加的特征层（或者任选的来自基础网络的现有特征层）可以使用一组卷积滤波器产生固定的检测预测集合。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。
24	Overfeat (3)		The convolutional model for predicting detections is different for each feature layer (cf Overfeat[4] and YOLO[5] that operate on a single scale feature map).用于预测检测的卷积模型对于每个特征层都是不同的（查阅Overfeat[4]和YOLO[5]在单尺度特征映射上的操作）。 OverFeat [4], a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories.OverFeat[4]是滑动窗口方法的深度版本，在知道了底层目标类别的置信度之后，直接从最顶层的特征映射的每个位置预测边界框。 If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat [4]; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO [5].如果我们只从最顶层的特征映射的每个位置使用一个默认框，我们的SSD将具有与OverFeat[4]相似的架构；如果我们使用整个最顶层的特征映射，并添加一个全连接层进行预测来代替我们的卷积预测器，并且没有明确地考虑多个长宽比，我们可以近似地再现YOLO[5]。
25	receptive (3)	[rɪˈseptɪv]	Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13].已知网络中不同层的特征映射具有不同的（经验的）感受野大小[13]。 Fortunately, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each layer.幸运的是，在SSD框架内，默认边界框不需要对应于每层的实际感受野。 An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map.改进SSD的另一种方法是设计一个更好的默认边界框平铺，使其位置和尺度与特征映射上每个位置的感受野更好地对齐。
26	atrous (3)	['eitrәs]	Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2$-s2 to $3\times 3$-s1, and use the atrous algorithm[18] to fill the “holes”.类似于DeepLab-LargeFOV[17]，我们将fc6和fc7转换为卷积层，从fc6和fc7中重采样参数，将pool5从$2\times 2$-s2更改为$3\times 3$-s1，并使用空洞算法[18]来填补这个“小洞”。 Atrous is faster.Atrous更快。 As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17].如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。
27	trainval35k (3)		Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: first train on COCO trainval35k then fine-tune on 07+12.“07+12+COCO”：首先在COCO trainval35k上训练然后在07+12上微调。 ”07++12+COCO”: first train on COCO trainval35k then fine-tune on 07++12.“07++12+COCO”：先在COCO trainval135k上训练然后在07++12上微调。 We use the trainval35k[24] for training.我们使用trainval35k[24]进行训练。
28	validate (3)	[ˈvælɪdeɪt]	To further validate the SSD framework, we trained our SSD300 and SSD512 architectures on the COCO dataset.为了进一步验证SSD框架，我们在COCO数据集上对SSD300和SSD512架构进行了训练。 Again, it validates that SSD is a general framework for high quality real-time detection.再一次证明了SSD是用于高质量实时检测的通用框架。 We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance.我们通过实验验证，在给定合适训练策略的情况下，大量仔细选择的默认边界框会提高性能。
29	discretize (2)	['diskri:taiz]	Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.我们的方法命名为SSD，将边界框的输出空间离散化为不同长宽比的一组默认框和并缩放每个特征映射的位置。 Allowing different default box shapes in several feature maps let us efficiently discretize the space of possible output box shapes.在几个特征映射中允许不同的默认边界框形状让我们有效地离散可能的输出框形状的空间。
30	Nvidia (2)	[ɪn'vɪdɪə]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model.对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 We thank NVIDIA for providing GPUs and acknowledge support from NSF 1452851, 1446631, 1526367, 1533771.我们感谢NVIDIA提供的GPU，并对NSF 1452851,1446631,1526367,1533771的支持表示感谢。
31	Titan (2)	[ˈtaɪtn]	For 300 × 300 input, SSD achieves $74.3\%$ mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves $76.9\%$ mAP, outperforming a comparable state-of-the-art Faster R-CNN model.对于300×300的输入，SSD在VOC2007测试中以59FPS的速度在Nvidia Titan X上达到$74.3\%$的mAP，对于512×512的输入，SSD达到了$76.9\%$的mAP，优于参照的最先进的Faster R-CNN模型。 We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz.我们使用Titan X、cuDNN v4、Intel Xeon E5-2667v3@3.20GHz以及批大小为8来测量速度。
32	suppression (2)	[səˈpreʃn]	The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections.SSD方法基于前馈卷积网络，该网络产生固定大小的边界框集合，并对这些边界框中存在的目标类别实例进行评分，然后进行非极大值抑制步骤来产生最终的检测结果。 Considering the large number of boxes generated from our method, it is essential to perform non-maximum suppression (nms) efficiently during inference.考虑到我们的方法产生大量边界框，在推断期间执行非最大值抑制（nms）是必要的。
33	truncated (2)	['trʌŋkeɪtɪd]	The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network.早期的网络层基于用于高质量图像分类的标准架构（在任何分类层之前被截断），我们将其称为基础网络。 Multi-scale feature maps for detection We add convolutional feature layers to the end of the truncated base network.用于检测的多尺度特征映射。我们将卷积特征层添加到截取的基础网络的末端。
34	progressively (2)	[prəˈgresɪvli]	These layers decrease in size progressively and allow predictions of detections at multiple scales.这些层在尺寸上逐渐减小，并允许在多个尺度上对检测结果进行预测。 To measure the advantage gained, we progressively remove layers and compare results.为了衡量所获得的优势，我们逐步删除层并比较结果。
35	propagation (2)	[ˌprɒpə'ɡeɪʃn]	Once this assignment is determined, the loss function and back propagation are applied end-to-end.一旦确定了这个分配，损失函数和反向传播就可以应用端到端了。 Since, as pointed out in [12], conv4_3 has a different feature scale compared to the other layers, we use the L2 normalization technique introduced in [12] to scale the feature norm at each location in the feature map to 20 and learn the scale during back propagation.如[12]所指出的，与其它层相比，由于conv4_3具有不同的特征尺度，所以我们使用[12]中引入的L2正则化技术将特征映射中每个位置的特征标准缩放到20，在反向传播过程中学习尺度。
36	mining (2)	[ˈmaɪnɪŋ]	Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.训练也涉及选择默认边界框集合和缩放进行检测，以及难例挖掘和数据增强策略。 Hard negative mining After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large.难例挖掘。在匹配步骤之后，大多数默认边界框为负例，尤其是当可能的默认边界框数量较多时。
37	regress (2)	[rɪˈgres]	Similar to Faster R-CNN[2], we regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h).类似于Faster R-CNN[2]，我们回归默认边界框(d)的中心偏移量(cx, cy)和其宽度(w)、高度(h)的偏移量。 Compared to R-CNN [22], SSD has less localization error, indicating that SSD can localize objects better because it directly learns to regress the object shape and classify object categories instead of using two decoupled steps.与R-CNN[22]相比，SSD具有更小的定位误差，表明SSD可以更好地定位目标，因为它直接学习回归目标形状和分类目标类别，而不是使用两个解耦步骤。
38	flip (2)	[flɪp]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。 Fast and Faster R-CNN use the original image and the horizontal flip to train.Fast和Faster R-CNN使用原始图像和水平翻转来训练。
39	DeepLab-LargeFOV (2)		Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2$-s2 to $3\times 3$-s1, and use the atrous algorithm[18] to fill the “holes”.类似于DeepLab-LargeFOV[17]，我们将fc6和fc7转换为卷积层，从fc6和fc7中重采样参数，将pool5从$2\times 2$-s2更改为$3\times 3$-s1，并使用空洞算法[18]来填补这个“小洞”。 As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17].如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。
40	Sensitivity (2)	[ˌsensəˈtɪvəti]	Fig. 4: Sensitivity and impact of different object characteristics on VOC2007 test set using [21].图4：使用[21]在VOC2007 test设置上不同目标特性的灵敏度和影响。 Fig. 6: Sensitivity and impact of object size with new data augmentation on VOC2007 test set using [21].图6：具有新的数据增强的目标尺寸在[21]中使用的VOC2007test数据集上灵敏度及影响。
41	prune (2)	[pru:n]	The reason might be that we do not have enough large boxes to cover large objects after the pruning.原因可能是修剪后我们没有足够大的边界框来覆盖大的目标。 When we use primarily finer resolution maps, the performance starts increasing again because even after pruning a sufficient number of large boxes remains.当我们主要使用更高分辨率的特征映射时，性能开始再次上升，因为即使在修剪之后仍然有足够数量的大边界框。
42	ION (2)	[ˈaɪən]	SSD300 has a similar mAP@0.75 as ION [24] and Faster R-CNN [25], but is worse in mAP@0.5.SSD300与ION 24]和Faster R-CNN[25]具有相似的mAP@0.75，但是mAP@0.5更差。 Compared to ION, the improvement in AR for large and small objects is more similar ($5.4\%$ vs.与ION相比，大型和小型目标的AR改进更为相似（5.4%和3.9%）。
43	DET (2)	[!≈ di: i: ti:]	We applied the same network architecture we used for COCO to the ILSVRC DET dataset [16].我们将在COCO上应用的相同网络架构应用于ILSVRC DET数据集[16]。 We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [22].我们使用[22]中使用的ILSVRC2014 DETtrain和val1来训练SSD300模型。
44	zoom (2)	[zu:m]	The random crops generated by the strategy can be thought of as a “zoom in” operation and can generate many larger training examples.策略产生的随机裁剪可以被认为是“放大”操作，并且可以产生许多更大的训练样本。 To implement a “zoom out” operation that creates more small training examples, we first randomly place an image on a canvas of 16× of the original image size filled with mean values before we do any random crop operation.为了实现创建更多小型训练样本的“缩小”操作，我们首先将图像随机放置在填充了平均值的原始图像大小为16x的画布上，然后再进行任意的随机裁剪操作。
45	nms (2)		Considering the large number of boxes generated from our method, it is essential to perform non-maximum suppression (nms) efficiently during inference.考虑到我们的方法产生大量边界框，在推断期间执行非最大值抑制（nms）是必要的。 We then apply nms with jaccard overlap of 0.45 per class and keep the top 200 detections per image.然后，我们应用nms，每个类别0.45的Jaccard重叠，并保留每张图像的前200个检测。
46	msec (2)	[m'zek]	This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers.对于SSD300和20个VOC类别，这个步骤每张图像花费大约1.7毫秒，接近在所有新增层上花费的总时间（2.4毫秒）。 This step costs about 1.7 msec per image for SSD300 and 20 VOC classes, which is close to the total time (2.4 msec) spent on all newly added layers.对于SSD300和20个VOC类别，这个步骤每张图像花费大约1.7毫秒，接近在所有新增层上花费的总时间（2.4毫秒）。
47	SPPnet (2)		SPPnet [9] speeds up the original R-CNN approach significantly.SPPnet[9]显著加快了原有的R-CNN方法。 Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness.Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。
48	Additionally (1)	[ə'dɪʃənəlɪ]	Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.此外，网络还结合了不同分辨率的多个特征映射的预测，自然地处理各种尺寸的目标。
49	encapsulate (1)	[ɪnˈkæpsjuleɪt]	SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network.相对于需要目标提出的方法，SSD非常简单，因为它完全消除了提出生成和随后的像素或特征重新采样阶段，并将所有计算封装到单个网络中。
50	variant (1)	[ˈveəriənt]	Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier.目前最先进的目标检测系统是以下方法的变种：假设边界框，每个框重采样像素或特征，并应用一个高质量的分类器。
51	hypothesize (1)	[haɪˈpɒθəsaɪz]	Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier.目前最先进的目标检测系统是以下方法的变种：假设边界框，每个框重采样像素或特征，并应用一个高质量的分类器。
52	prevail (1)	[prɪˈveɪl]	This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3].自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。
53	albeit (1)	[ˌɔ:lˈbi:ɪt]	This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3].自从选择性搜索[1]通过在PASCAL VOC，COCO和ILSVRC上所有基于Faster R-CNN[2]的检测都取得了当前领先的结果（尽管具有更深的特征如[3]），这种流程在检测基准数据上流行开来。
54	computationally (1)	[!≈ ˌkɒmpjuˈteɪʃənli]	While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications.尽管这些方法准确，但对于嵌入式系统而言，这些方法的计算量过大，即使是高端硬件，对于实时应用而言也太慢。
55	SPF (1)	[.es piː 'ef]	Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS).通常，这些方法的检测速度是以每帧秒（SPF）度量，甚至最快的高精度检测器，Faster R-CNN，仅以每秒7帧（FPS）的速度运行。
56	residual (1)	[rɪˈzɪdjuəl]	This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks [3].相比于最近备受瞩目的残差网络方面的工作[3]，在检测精度上这是相对更大的提高。
57	trade-off (1)	[ˈtreɪdˌɔ:f, -ˌɔf]	These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.这些设计功能使得即使在低分辨率输入图像上也能实现简单的端到端训练和高精度，从而进一步提高速度与精度之间的权衡。
58	dataset-specific (1)	[!≈ 'deɪtəset spəˈsɪfɪk]	Afterwards, Sec. 2.3 presents dataset-specific model details and experimental results.之后，2.3节介绍了数据集特有的模型细节和实验结果。
59	feed-forward (1)	['fi:df'ɔ:wəd]	The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections.SSD方法基于前馈卷积网络，该网络产生固定大小的边界框集合，并对这些边界框中存在的目标类别实例进行评分，然后进行非极大值抑制步骤来产生最终的检测结果。
60	auxiliary (1)	[ɔ:gˈzɪliəri]	We then add auxiliary structure to the network to produce detections with the following key features:然后，我们将辅助结构添加到网络中以产生具有以下关键特征的检测：
61	optionally (1)	['ɒpʃənəlɪ]	Convolutional predictors for detection Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters.用于检测的卷积预测器。每个添加的特征层（或者任选的来自基础网络的现有特征层）可以使用一组卷积滤波器产生固定的检测预测集合。
62	kmn (1)		This results in a total of (c+4)k filters that are applied around each location in the feature map, yielding (c+4)kmn outputs for a $m\times n$ feature map.这导致在特征映射中的每个位置周围应用总共(c+4)k个滤波器，对于$m\times n$的特征映射取得(c+4)kmn个输出。
63	Fig.1. (1)		For an illustration of default boxes, please refer to Fig.1.有关默认边界框的说明，请参见图1。
64	indicator (1)	[ˈɪndɪkeɪtə(r)]	Let $x_{ij}^p = \lbrace 1,0 \rbrace$ be an indicator for matching the i-th default box to the j-th ground truth box of category p.设$x_{ij}^p = \lbrace 1,0 \rbrace$是第i个默认边界框匹配到类别p的第j个实际边界框的指示器。
65	mimic (1)	[ˈmɪmɪk]	However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales.然而，通过利用单个网络中几个不同层的特征映射进行预测，我们可以模拟相同的效果，同时还可以跨所有目标尺度共享参数。
66	semantic (1)	[sɪˈmæntɪk]	Previous works [10,11] have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects.以前的工作[10,11]已经表明，使用低层的特征映射可以提高语义分割的质量，因为低层会捕获输入目标的更多细节。
67	exemplar (1)	[ɪgˈzemplɑ:(r)]	Figure 1 shows two exemplar feature maps (8 × 8 and 4 × 4) which are used in the framework.图1显示了框架中使用的两个示例性特征映射（8×8和4×4）。
68	empirical (1)	[ɪmˈpɪrɪkl]	Feature maps from different levels within a network are known to have different (empirical) receptive field sizes [13].已知网络中不同层的特征映射具有不同的（经验的）感受野大小[13]。
69	imbalance (1)	[ɪmˈbæləns]	This introduces a significant imbalance between the positive and negative training examples.这在正的训练实例和负的训练实例之间引入了显著的不平衡。
70	aforementioned (1)	[əˌfɔ:ˈmenʃənd]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
71	resize (1)	[ˌri:ˈsaɪz]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
72	horizontally (1)	[ˌhɒrɪ'zɒntəlɪ]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
73	photo-metric (1)	[!≈ ˈfəʊtəʊ ˈmetrɪk]	After the aforementioned sampling step, each sampled patch is resized to fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions similar to those described in [14].在上述采样步骤之后，除了应用类似于文献[14]中描述的一些光度变形之外，将每个采样图像块调整到固定尺寸并以0.5的概率进行水平翻转。
74	VGG16[15 (1)		Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16].基础网络。我们的实验全部基于VGG16[15]，它是在ILSVRC CLS-LOC数据集[16]上预先训练的。
75	CLS-LOC (1)		Base network Our experiments are all based on VGG16[15], which is pre-trained on the ILSVRC CLS-LOC dataset[16].基础网络。我们的实验全部基于VGG16[15]，它是在ILSVRC CLS-LOC数据集[16]上预先训练的。
76	subsample (1)	['sʌbsɑ:mpl]	Similar to DeepLab-LargeFOV[17], we convert fc6 and fc7 to convolutional layers, subsample parameters from fc6 and fc7, change pool5 from $2\times 2$-s2 to $3\times 3$-s1, and use the atrous algorithm[18] to fill the “holes”.类似于DeepLab-LargeFOV[17]，我们将fc6和fc7转换为卷积层，从fc6和fc7中重采样参数，将pool5从$2\times 2$-s2更改为$3\times 3$-s1，并使用空洞算法[18]来填补这个“小洞”。
77	SGD (1)	['esdʒ'i:d'i:]	We fine-tune the resulting model using SGD with initial learning rate $10^{-3}$, 0.9 momentum, 0.0005 weight decay, and batch size 32.我们使用SGD对得到的模型进行微调，初始学习率为$10^{-3}$，动量为0.9，权重衰减为0.0005，批数据大小为32。
78	momentum (1)	[məˈmentəm]	We fine-tune the resulting model using SGD with initial learning rate $10^{-3}$, 0.9 momentum, 0.0005 weight decay, and batch size 32.我们使用SGD对得到的模型进行微调，初始学习率为$10^{-3}$，动量为0.9，权重衰减为0.0005，批数据大小为32。
79	Caffe (1)		The full training and testing code is built on Caffe[19] and is open source at: https://github.com/weiliu89/caffe/tree/ssd.完整的训练和测试代码建立在Caffe[19]上并开源：https://github.com/weiliu89/caffe/tree/ssd。
80	xavier (1)	['zʌvɪə]	We initialize the parameters for all the newly added convolutional layers with the “xavier” method [20].我们使用“xavier”方法[20]初始化所有新添加的卷积层的参数。
81	surpass (1)	[səˈpɑ:s]	When we train SSD on a larger $512\times 512$ input image, it is even more accurate, surpassing Faster R-CNN by $1.7\%$ mAP.当我们用更大的$512\times 512$输入图像上训练SSD时，它更加准确，超过了Faster R-CNN $1.7\%$的mAP。
82	i.e. (1)	[ˌaɪ ˈi:]	If we train SSD with more (i.e. 07+12) data, we see that SSD300 is already better than Faster R-CNN by 1.1\% and that SSD512 is $3.6\%$ better.如果我们用更多的（即07+12）数据来训练SSD，我们看到SSD300已经比Faster R-CNN好$1.1\%$，SSD512比Faster R-CNN好$3.6\%$。
83	decouple (1)	[di:ˈkʌpl]	Compared to R-CNN [22], SSD has less localization error, indicating that SSD can localize objects better because it directly learns to regress the object shape and classify object categories instead of using two decoupled steps.与R-CNN[22]相比，SSD具有更小的定位误差，表明SSD可以更好地定位目标，因为它直接学习回归目标形状和分类目标类别，而不是使用两个解耦步骤。
84	Visualization (1)	[ˌvɪʒʊəlaɪ'zeɪʃn]	Fig. 3: Visualization of performance for SSD512 on animals, vehicles, and furniture from VOC2007 test.图3：SSD512在VOC2007 test中的动物，车辆和家具上的性能可视化。
85	cumulative (1)	[ˈkju:mjələtɪv]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG).第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
86	Cor (1)	[kɔ:(r)]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG).第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
87	Sim (1)	[sɪm]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG).第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
88	Oth (1)		The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG).第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
89	BG (1)	[!≈ bi: dʒi:]	The top row shows the cumulative fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar categories (Sim), with others (Oth), or with background (BG).第一行显示由于定位不佳（Loc），与相似类别（Sim）混淆，与其它（Oth）或背景（BG）相关的正确检测（Cor）或假阳性的累积分数。
90	dash (1)	[dæʃ]	The dashed red line is using the weak criteria (0.1 jaccard overlap).红色虚线是使用弱标准（0.1 Jaccard重叠）。
91	top-ranked (1)	['tɒpr'æŋkt]	The bottom row shows the distribution of top-ranked false positive types.最下面一行显示了排名靠前的假阳性类型的分布。
92	XW (1)	[!≈ eks 'dʌblju:]	Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide.长宽比：XT=超高/窄；T=高；M=中等；W=宽；XW =超宽。
93	extra-wide (1)	[!≈ ˈekstrə waɪd]	Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide.长宽比：XT=超高/窄；T=高；M=中等；W=宽；XW =超宽。
94	subsampled (1)		As described in Sec. 3, we used the atrous version of a subsampled VGG16, following DeepLab-LargeFOV [17].如第3节所述，我们根据DeepLab-LargeFOV[17]使用子采样的VGG16的空洞版本。
95	subsampling (1)		If we use the full VGG16, keeping pool5 with 2×2−s2 and not subsampling parameters from fc6 and fc7, and add conv5 3 for prediction, the result is about the same while the speed is about $20\%$ slower.如果我们使用完整的VGG16，保持pool5为2×2-s2，并且不从fc6和fc7中子采样参数，并添加conv5_3进行预测，结果大致相同，而速度慢了大约20%。
96	exhaustively (1)	[ɪɡ'zɔ:stɪvlɪ]	We do not exhaustively optimize the tiling for each setting.我们没有详尽地优化每个设置的平铺。
97	monotonically (1)	[mɒnə'tɒnɪklɪ]	Table 3 shows a decrease in accuracy with fewer layers, dropping monotonically from 74.3 to 62.4.表3显示层数较少，精度降低，从74.3单调递减至62.4。
98	ROI (1)	[rwɑ:]	Besides, since our predictions do not rely on ROI pooling as in [6], we do not have the collapsing bins problem in low-resolution feature maps [23].此外，由于我们的预测不像[6]那样依赖于ROI池化，所以我们在低分辨率特征映射中没有折叠组块的问题[23]。
99	conjecture (1)	[kənˈdʒektʃə(r)]	$3.9\%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part.我们推测Faster R-CNN在较小的目标上比SSD更具竞争力，因为它在RPN部分和Fast R-CNN部分都执行了两个边界框细化步骤。
100	refinement (1)	[rɪˈfaɪnmənt]	$3.9\%$). We conjecture that Faster R-CNN is more competitive on smaller objects with SSD because it performs two box refinement steps, in both the RPN part and in the Fast R-CNN part.我们推测Faster R-CNN在较小的目标上比SSD更具竞争力，因为它在RPN部分和Fast R-CNN部分都执行了两个边界框细化步骤。
101	Preliminary (1)	[prɪˈlɪmɪnəri]	3.5 Preliminary ILSVRC results3.5 初步的ILSVRC结果
102	ILSVRC2014 (1)		We train a SSD300 model using the ILSVRC2014 DET train and val1 as used in [22].我们使用[22]中使用的ILSVRC2014 DETtrain和val1来训练SSD300模型。
103	follow-up (1)	['fɒləʊ ʌp]	Without a follow-up feature resampling step as in Faster R-CNN, the classification task for small objects is relatively hard for SSD, as demonstrated in our analysis (see Fig. 4).SSD没有如Faster R-CNN中后续的特征重采样步骤，小目标的分类任务对SSD来说相对困难，正如我们的分析（见图4）所示。
104	underscore (1)	[ˌʌndəˈskɔ:(r)]	This result underscores the importance of the data augmentation strategy for the final model accuracy.这个结果强调了数据增强策略对最终模型精度的重要性。
105	align (1)	[əˈlaɪn]	An alternative way of improving SSD is to design a better tiling of default boxes so that its position and scale are better aligned with the receptive field of each position on a feature map.改进SSD的另一种方法是设计一个更好的默认边界框平铺，使其位置和尺度与特征映射上每个位置的感受野更好地对齐。
106	cuDNN (1)		We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz.我们使用Titan X、cuDNN v4、Intel Xeon E5-2667v3@3.20GHz以及批大小为8来测量速度。
107	Xeon (1)		We measure the speed with batch size 8 using Titan X and cuDNN v4 with Intel Xeon E5-2667v3@3.20GHz.我们使用Titan X、cuDNN v4、Intel Xeon E5-2667v3@3.20GHz以及批大小为8来测量速度。
108	advent (1)	[ˈædvent]	Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance.在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
109	Deformable (1)	[dɪ'fɔ:məbl]	Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance.在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
110	DPM (1)	[!≈ di: pi: em]	Before the advent of convolutional neural networks, the state of the art for those two approaches —— Deformable Part Model (DPM) [26] and Selective Search [1] —— had comparable performance.在卷积神经网络出现之前，这两种方法的最新技术——可变形部件模型（DPM）[26]和选择性搜索[1]——具有相当的性能。
111	prevalent (1)	[ˈprevələnt]	However, after the dramatic improvement brought on by R-CNN [22], which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent.然而，在R-CNN[22]结合选择性搜索区域提出和基于后分类的卷积网络带来的显著改进后，区域提出目标检测方法变得流行。
112	time-consuming (1)	[taɪm kən'sju:mɪŋ]	The first set of approaches improve the quality and speed of post-classification, since it requires the classification of thousands of image crops, which is expensive and time-consuming.第一套方法提高了后分类的质量和速度，因为它需要对成千上万的裁剪图像进行分类，这是昂贵和耗时的。
113	objectness (1)	[!≈ ˈɒbdʒɪktnəs]	Fast R-CNN [6] extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox [7] for learning objectness.Fast R-CNN[6]扩展了SPPnet，使得它可以通过最小化置信度和边界框回归的损失来对所有层进行端到端的微调，最初在MultiBox[7]中引入用于学习目标。
114	setup (1)	['setʌp]	This further improves the detection accuracy but results in a somewhat complex setup, requiring the training of two neural networks with a dependency between them.这进一步提高了检测精度，但是导致了一些复杂的设置，需要训练两个具有依赖关系的神经网络。
115	complication (1)	[ˌkɒmplɪˈkeɪʃn]	Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.因此，我们的方法避免了将RPN与Fast R-CNN合并的复杂性，并且更容易训练，更快且更直接地集成到其它任务中。
116	experimentally (1)	[ɪkˌsperɪ'mentəlɪ]	We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance.我们通过实验验证，在给定合适训练策略的情况下，大量仔细选择的默认边界框会提高性能。
117	favorably (1)	['feɪvərəblɪ]	We demonstrate that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed.我们证明了给定相同的VGG-16基础架构，SSD在准确性和速度方面与其对应的最先进的目标检测器相比毫不逊色。
118	standalone (1)	['stændəˌləʊn]	Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component.除了单独使用之外，我们相信我们的整体和相对简单的SSD模型为采用目标检测组件的大型系统提供了有用的构建模块。
119	monolithic (1)	[ˌmɒnə'lɪθɪk]	Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component.除了单独使用之外，我们相信我们的整体和相对简单的SSD模型为采用目标检测组件的大型系统提供了有用的构建模块。
120	recurrent (1)	[rɪˈkʌrənt]	A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video simultaneously.一个有前景的未来方向是探索它作为系统的一部分，使用循环神经网络来同时检测和跟踪视频中的目标。
121	Acknowledgment (1)	[ək'nɒlɪdʒmənt]	6. Acknowledgment6. 致谢
122	internship (1)	[ˈɪntɜ:nʃɪp]	This work was started as an internship project at Google and continued at UNC.这项工作是在谷歌的一个实习项目开始的，并在UNC继续。
123	UNC (1)	[ʌŋk]	This work was started as an internship project at Google and continued at UNC.这项工作是在谷歌的一个实习项目开始的，并在UNC继续。
124	Alex (1)	['ælɪkʃ]	We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google.我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
125	Toshev (1)		We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google.我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
126	indebted (1)	[ɪnˈdetɪd]	We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google.我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
127	DistBelief (1)		We would like to thank Alex Toshev for helpful discussions and are indebted to the Image Understanding and DistBelief teams at Google.我们要感谢Alex Toshev进行有益的讨论，并感谢Google的Image Understanding和DistBelief团队。
128	Ammirato (1)		We also thank Philip Ammirato and Patrick Poirson for helpful comments.我们也感谢Philip Ammirato和Patrick Poirson提供有用的意见。
129	Patrick (1)	[ˈpætrik]	We also thank Philip Ammirato and Patrick Poirson for helpful comments.我们也感谢Philip Ammirato和Patrick Poirson提供有用的意见。
130	Poirson (1)		We also thank Philip Ammirato and Patrick Poirson for helpful comments.我们也感谢Philip Ammirato和Patrick Poirson提供有用的意见。
131	NSF (1)	[!≈ en es ef]	We thank NVIDIA for providing GPUs and acknowledge support from NSF 1452851, 1446631, 1526367, 1533771.我们感谢NVIDIA提供的GPU，并对NSF 1452851,1446631,1526367,1533771的支持表示感谢。

Words List (appearance)

Words List (frequency)